What to do with this kind of Spam?

Tue Jul 15 02:33:57 CEST 2003

At 08:20 PM 7/14/03, John McCain wrote:
>On Monday 14 July 2003 05:33 pm, michael at optusnet.com.au wrote:
>
> > Sorry, I don't belive that at all.
>
>Believe what?  That we need to adapt or that I'm having problems with
>Bogofilter?
>
> > The problem is that Bogofilter
> > currently discards the information that's needed to deal with this
> > type of attack. Even just scoring token pairs rather than single tokens
> > is probably enough to defeat this style of attack.
> >
>
>How so?  If the tokens are garbage tokens, and thus unique, the likelihood of
>seeing two back to back in a message is astronomically smaller than seeing a
>repeat of a single.

John,

Garbage tokens are relatively unimportant.  Bogofilter scores messages 
based on tokens it recognizes, i.e. tokens with which it has been 
trained.  New, unknown tokens don't affect the score (because min_dev 
causes them to be ignored).

Inclusion of lots of "harmless" words has a similar effect.  Because the 
words tend to have neutral scores (near 0.500) they, too, are ignored.

For "harmless" words to have a significant effect on the score, they need 
to be "hammis" words (that have a low score, i.e. near 0).  There's no way 
for a spammer to know what words are hammish because that's a 
characteristic of the wordlists at _your_ site.

If you have a message with lots of garbage tokens in it and if you use it 
to train bogofilter, then your wordlists will contain those tokens, i.e. 
will be somewhat bigger than is ideal.  Since disk space is cheap, I tend 
not to worry much about using it.

David