html_tokenizer

Fri Feb 21 00:58:43 CET 2003

At 06:27 PM 2/20/03, Nick Simicich wrote:
>As a side issue in html_tokenizer, it might be reasonable to deal with 
>quoted strings inside of tokens.  Specifically, if someone codes something as:
>
><a href="http://foo.bar.com/whatever.html"> how should that be tokenized?
>
>Should the entire quoted string be one token?

The subject of what to do with tokens inside of html tags is, at best, very 
unclear.  Bogofilter needs tokens that can be matched between 
messages.  Are symbolic URL's matchable?  Some parts may be and some may 
not.  For example, "http://foo.bar.com/cgi-bin/whatever?asdf&qwerty&uiop" 
is legit, even though the last three tokens may be totally bogus.

It's worth an experiment to modify the parser to keep the innards of _real_ 
tags; maybe even use an "href:" prefix (or other suitable 
string(s)).  Given the modified parser, we can run experiments to see if 
the mods help bogofilter, or not.

>Should a > inside the quoted string be respected?
>
><a href="http://foo.bar.com/whatever >.html" >  Current code ends the 
>token at the  before the first  > and the token at the first >, but 
>perhaps it should be ended at the " and the token at the second >?
>
>It would be simple to end the quoted string by either (1) looking for 
>pairs of quotes in preference to other things when you are in the state 
>that indicates that you are in a token that is not a comment or...
>
>starting another state that is used when you have a token that starts with 
>a " when you are inside a html tag.
>
>There are two reasons for doing this "right".  One is that tagging tokens 
>as from within html tags may be important to telling spam from 
>non-spam.  HTML with table tags might be spam, while a question about how 
>is <td> used, in a plain text section, might not be spam.

Code away so we can do some tests and see what is useful and what is 
not.  For sure, include a way to enable/disable features so their 
usefulness can be tested.

>the other might be that this might be a simple way for spammers to make 
>things seem less spammy.
>Does anyone have any spam-in-the-wild cases of people protecting their 
>spam words using any technique like this?

Perhaps not relevant, recently I've noticed a lot of garbage strings inside 
of spam.  It often looks like character sequences straight from the 
keyboard, i.e. "qwertyuiop", "asdf", etc.  Remebering that I had noticed 
"asdf" in one spam, I ran "grep -c asdf spam.Feb.2003/*" and found that 85 
of 3563 spam I've received this month contain that particular string.