html_tokenizer

Nick Simicich njs at scifi.squawk.com
Fri Feb 21 00:27:56 CET 2003


As a side issue in html_tokenizer, it might be reasonable to deal with 
quoted strings inside of tokens.  Specifically, if someone codes something as:

<a href="http://foo.bar.com/whatever.html"> how should that be tokenized?

Should the entire quoted string be one token?

Should a > inside the quoted string be respected?

<a href="http://foo.bar.com/whatever >.html" >  Current code ends the token 
at the  before the first  > and the token at the first >, but perhaps it 
should be ended at the " and the token at the second >?

It would be simple to end the quoted string by either (1) looking for pairs 
of quotes in preference to other things when you are in the state that 
indicates that you are in a token that is not a comment or...

starting another state that is used when you have a token that starts with 
a " when you are inside a html tag.

There are two reasons for doing this "right".  One is that tagging tokens 
as from within html tags may be important to telling spam from 
non-spam.  HTML with table tags might be spam, while a question about how 
is <td> used, in a plain text section, might not be spam.

the other might be that this might be a simple way for spammers to make 
things seem less spammy.
Does anyone have any spam-in-the-wild cases of people protecting their spam 
words using any technique like this?

--
SPAM: Trademark for spiced, chopped ham manufactured by Hormel.
spam: Unsolicited, Bulk E-mail, where e-mail can be interpreted generally 
to mean electronic messages designed to be read by an individual, and it 
can include Usenet, SMS, AIM, etc.  But if it is not all three of 
Unsolicited, Bulk, and E-mail, it simply is not spam. Misusing the term 
plays into the hands of the spammers, since it causes confusion, and 
spammers thrive on  confusion. Spam is not speech, it is an action, like 
theft, or vandalism. If you were not confused, would you patronize a spammer?
Nick Simicich - njs at scifi.squawk.com - http://scifi.squawk.com/njs.html
Stop by and light up the world!



More information about the Bogofilter mailing list