how many tokens?
David Relson
relson at osagesoftware.com
Wed Feb 26 18:43:45 CET 2003
Greeting!
Looking at the html sample below, it's not obvious how it should be
tokenized. At least, it's not obvious to me. What do y'all thing?
### here's the sample ###
Book®
<a href=http://example.com>Book of the Month Club</a>® for
No Due Dates<FONT COLOR="fef0d0">zzzzzz</FONT>No Hidden Charges<FONT
COLOR="#fef0d0">zzzzzz</FONT>No
The questions are:
line 1 - Is "Book®" the token or should it be "Book"?
line 2 - Should "Club</a>®" produce "Club®" or "Club"?
line 3 - Should "Dates<FONT COLOR="fef0d0">zzzzzz</FONT>No" produce two
tokens ("dates", "zzzzzz") or just one, i.e. "dateszzzzzzno" ?
At the moment I have several parser variations that given slightly
different results. What do _we_ want?
David
More information about the bogofilter-dev
mailing list