how many tokens?

David Relson relson at osagesoftware.com
Wed Feb 26 18:43:45 CET 2003


Greeting!

Looking at the html sample below, it's not obvious how it should be 
tokenized.  At least, it's not obvious to me.  What do y'all thing?

### here's the sample ###

Book®
<a href=http://example.com>Book of the Month Club</a>® for
No Due Dates<FONT COLOR="fef0d0">zzzzzz</FONT>No Hidden Charges<FONT 
COLOR="#fef0d0">zzzzzz</FONT>No


The questions are:

line 1 - Is "Book®" the token or should it be "Book"?

line 2 - Should "Club</a>®" produce "Club®" or "Club"?

line 3 - Should "Dates<FONT COLOR="fef0d0">zzzzzz</FONT>No" produce two 
tokens ("dates", "zzzzzz") or just one, i.e. "dateszzzzzzno" ?

At the moment I have several parser variations that given slightly 
different results.  What do _we_ want?

David





More information about the bogofilter-dev mailing list