obscured URL not being tokenized

Sun Dec 21 20:39:12 CET 2003

On Sat, 2003-12-20 at 23:07, Dan Singletary wrote:
> Now, I understand that we want to keep bogofilter as simple as 
> possible.... but it should also be taken into consideration wether we 
> want bogofilter to be comparing apples and oranges or not.

Obviously we want to ignore comments and other invisible stuff, however,
not at the expense of turning bogofilter from a Bayesian filter into a
rules-based filter.  The surprising and wonderful characteristic of a
Bayesian filter is that it finds significance in things you would just
throw away, and discovers that what you thought was significant hardly
even registers.  If "display:none" is used extensively in spams, and
almost never in hams, then that'll tip the scale way into the spam
direction irrespective of the 50 slightly hammish terms following it. 
It might slip past the filter for the first dozen times, but as you
train on error, bogofilter quickly learns.

If you really wanted full html rendering, my suggestion would be to use
a pre-filter which does that and then outputs an html comment ""
around the offending text.  This way, bogofilter could contain only very
simplistic html removal and remain very slim and fast, concentrating on
its Bayesian functionality.  The html rendering could thus be turned on
or off modularly.

> rendering would be much simplified from a graphical renderer, but would 
> basically render the HTML just as any other browser would.  This, would 
> definitely bloat the code, but seeing how a lot of my spam IS in HTML, 
> it might be beneficial.

For the html module, I'd suggest pulling some code from the Mozilla
project.  You can't do it half-assed or you'll miss many of the
spammers' sneaky tricks.  It needs to be a fully functional renderer,
including css and javascript.  If you can use a Gecko API, then upgrades
to future versions of html, etc., will be simple.  Then on top of that,
you need to add in image recognition to determine what color of a
background image is behind which color of text.  Then you'll need
character recognition to parse out the text contained in an image.  This
won't be a small module, and not necessarily very effective.  

> In the interim, I think that David's ideas for giving bogofilter some 
> ability to recognize the color tags are definitely a step in the right 
> direction.

All a spammer needs to do is encode the spam message in a "noisy" image
above the fold, and then put some hammish terms way at the bottom,
hidden by position rather than trickery of color.  Then you'd be back to
relying on Bayes, which would probably pull out "width='500'" and
"height='400'" from the IMG tag as very spammy.

My point is that with rules-based filters, you're in an arms race
against spammers.  Tricks and counter-tricks.  With a Bayesian filter,
this isn't the case... just train the filter on errors and it quickly
learns the new tricks and flags them appropriately without modifying a
single line of code.

Tom
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <https://www.bogofilter.org/pipermail/bogofilter/attachments/20031221/1fbe96a3/attachment.sig>