HTML treatment [was: how many tokens?]

Wed Feb 26 19:48:32 CET 2003

On Wed, Feb 26, 2003 at 01:17:24PM -0500, David Relson wrote:
> 
> That brings up an interesting, related question.  What should bogofilter do 
> with tokens inside of html tags?  Off the top of my head, there are the 
> following choices:
> 
> 1 - discard all of them
> 2 - process all of them
> 3 - keep valid tags and discard invalid tags
> 4 - keep/discard colors
> 5 - keep/discard hrefs.
> 5a - if keeping href, keep/discard cgi parameters

Running this email
	http://ladro.com/bf/20030226-01.txt
through bogolexer version 0.10.1.2 it looks like I'm missing the parsing
of this img src url:
	http://www.homebusinesszone.net/printer/clipart/specs.gif
What's odd is that I cut out that part of the email, available here
	http://ladro.com/bf/20030226-02.txt
and it correctly gets out the "printer" "clipart" tokens.

> It seems that all html tags can be abused by including random character 
> sequences.  Some of the listed choices are given with the thought of 
> keeping the "good" stuff and discarding the random stuff.

Yep.  I've seen some spam come through with totally random characters in
it, probably there to throw off some BF like spam programs.

> At the current time, bogofilter discards the innards.  It's a trivial 
> change to tokenize them.  The other options are more difficult.

It looks like bogolexer keeps around some innards, but not all the time.
Maybe its me misusing the tool.

> Also, should bogofilter convert items like &123; to their characters?

Along those lines, would it be helpful to convert any IP only URLs into
some magic token?  A lot of the spam sites don't list domain names in
them, but rather the IP address of the server.

Chris