HTML treatment [was: how many tokens?]

Wed Feb 26 19:17:24 CET 2003

At 12:59 PM 2/26/03, Chris Wilkes wrote:

>On Wed, Feb 26, 2003 at 12:43:45PM -0500, David Relson wrote:
> >
> > No Due Dates<FONT COLOR="fef0d0">zzzzzz</FONT>No Hidden Charges<FONT
> > COLOR="#fef0d0">zzzzzz</FONT>No
> >
> > line 3 - Should "Dates<FONT COLOR="fef0d0">zzzzzz</FONT>No" produce
> > two
> > tokens ("dates", "zzzzzz") or just one, i.e. "dateszzzzzzno" ?
>
>Do you want to keep the FONT tags around?  A lot of spam HTML email has
>crazy fonts all over the place and I think a count of them would help
>identify spam.
>
>Course I'm of the mind that any HTML email I get is highly suspect from
>the get-go.  Maybe I should make a pre-filter for my script to run BF so
>I can have seperate text and html email file databases and cutoff rules.
>Anyone doing that?
>
>Chris

Chris,

That brings up an interesting, related question.  What should bogofilter do 
with tokens inside of html tags?  Off the top of my head, there are the 
following choices:

1 - discard all of them
2 - process all of them
3 - keep valid tags and discard invalid tags
4 - keep/discard colors
5 - keep/discard hrefs.
5a - if keeping href, keep/discard cgi parameters

It seems that all html tags can be abused by including random character 
sequences.  Some of the listed choices are given with the thought of 
keeping the "good" stuff and discarding the random stuff.

At the current time, bogofilter discards the innards.  It's a trivial 
change to tokenize them.  The other options are more difficult.

Also, should bogofilter convert items like &123; to their characters?

David