how many tokens?

Nick Simicich njs at scifi.squawk.com
Thu Feb 27 19:03:12 CET 2003


At 12:43 PM 2003-02-26 -0500, David Relson wrote:

>Greeting!
>
>Looking at the html sample below, it's not obvious how it should be 
>tokenized.  At least, it's not obvious to me.  What do y'all thing?
>
>### here's the sample ###
>
>Book®
><a href=http://example.com>Book of the Month Club</a>® for
>No Due Dates<FONT COLOR="fef0d0">zzzzzz</FONT>No Hidden Charges<FONT 
>COLOR="#fef0d0">zzzzzz</FONT>No
>
>
>The questions are:
>
>line 1 - Is "Book®" the token or should it be "Book"?

It depends on whether ® is a character that we decided is part of tokens or 
not.  I vote no.

>line 2 - Should "Club</a>®" produce "Club®" or "Club"?

</a> does not produce an eyespace break, so that should not break the token.

>line 3 - Should "Dates<FONT COLOR="fef0d0">zzzzzz</FONT>No" produce two 
>tokens ("dates", "zzzzzz") or just one, i.e. "dateszzzzzzno" ?

should "Dat<FONT COLOR="fef0d0">e</FONT>s" produce one token or three?  It 
will be rendered as a single word with an odd colored e.

font does not produce an eyespace break.  How about...

"Dat<FONT COLOR="fef0d0"></FONT>ing"  --- that should definitely produce 
only one token --- and the answer is the same as the above - if it does not 
cause an eyespace break, it should not cause a token break.

--
SPAM: Trademark for spiced, chopped ham manufactured by Hormel.
spam: Unsolicited, Bulk E-mail, where e-mail can be interpreted generally 
to mean electronic messages designed to be read by an individual, and it 
can include Usenet, SMS, AIM, etc.  But if it is not all three of 
Unsolicited, Bulk, and E-mail, it simply is not spam. Misusing the term 
plays into the hands of the spammers, since it causes confusion, and 
spammers thrive on  confusion. Spam is not speech, it is an action, like 
theft, or vandalism. If you were not confused, would you patronize a spammer?
Nick Simicich - njs at scifi.squawk.com - http://scifi.squawk.com/njs.html
Stop by and light up the world!



More information about the bogofilter-dev mailing list