html_tokenizer
David Relson
relson at osagesoftware.com
Fri Feb 21 00:58:43 CET 2003
At 06:27 PM 2/20/03, Nick Simicich wrote:
>As a side issue in html_tokenizer, it might be reasonable to deal with
>quoted strings inside of tokens. Specifically, if someone codes something as:
>
><a href="http://foo.bar.com/whatever.html"> how should that be tokenized?
>
>Should the entire quoted string be one token?
The subject of what to do with tokens inside of html tags is, at best, very
unclear. Bogofilter needs tokens that can be matched between
messages. Are symbolic URL's matchable? Some parts may be and some may
not. For example, "http://foo.bar.com/cgi-bin/whatever?asdf&qwerty&uiop"
is legit, even though the last three tokens may be totally bogus.
It's worth an experiment to modify the parser to keep the innards of _real_
tags; maybe even use an "href:" prefix (or other suitable
string(s)). Given the modified parser, we can run experiments to see if
the mods help bogofilter, or not.
>Should a > inside the quoted string be respected?
>
><a href="http://foo.bar.com/whatever >.html" > Current code ends the
>token at the before the first > and the token at the first >, but
>perhaps it should be ended at the " and the token at the second >?
>
>It would be simple to end the quoted string by either (1) looking for
>pairs of quotes in preference to other things when you are in the state
>that indicates that you are in a token that is not a comment or...
>
>starting another state that is used when you have a token that starts with
>a " when you are inside a html tag.
>
>There are two reasons for doing this "right". One is that tagging tokens
>as from within html tags may be important to telling spam from
>non-spam. HTML with table tags might be spam, while a question about how
>is <td> used, in a plain text section, might not be spam.
Code away so we can do some tests and see what is useful and what is
not. For sure, include a way to enable/disable features so their
usefulness can be tested.
>the other might be that this might be a simple way for spammers to make
>things seem less spammy.
>Does anyone have any spam-in-the-wild cases of people protecting their
>spam words using any technique like this?
Perhaps not relevant, recently I've noticed a lot of garbage strings inside
of spam. It often looks like character sequences straight from the
keyboard, i.e. "qwertyuiop", "asdf", etc. Remebering that I had noticed
"asdf" in one spam, I ran "grep -c asdf spam.Feb.2003/*" and found that 85
of 3563 spam I've received this month contain that particular string.
More information about the Bogofilter
mailing list