How to deal with extremely high spam levels

Tom Allison tallison at tacocat.net
Tue Jun 22 23:10:25 CEST 2004


Bob Vincent wrote:
> Bogofilter is apparently designed for the situation where the number
> of spams per day roughly equals the number of non-spams per day.
> 
> In my situation, the ratio exceeds 100:1.  In the two weeks I've been
> re-training bogofilter, I've collected:
> 

It will take some time.  But if you think about how bogofilter operates, 
this should not be too much of a problem.  bogofilter, especially in a 
tri-state, has to work out two sets of scores: how spammy is it, and how 
hammy is it?

You are working up a great sample for the spammy question but not much 
for the hammy side of the equation.  I expect that you will find it 
becoming very good at detecting spam to the point where you will start 
finding most of your unsures are actually ham.

When I started running bogofilter, it was kind of "dumb" until I had 
about 100 emails in each category.  At that point it started to show 
some consistency and intelligent guessing.  By the time I got to 500 
emails in each is was pretty much a shoe-in.

The other thing you can do to improve your performance, even without 
bogotune, is to start checking to see what kind of scores you are 
getting in your unsure and modify the cutoffs to approach those scores.

One thing that I did for fun was to actually create a filter of eleven 
mailboxes, each filtering on bogofilter scores of 0.0, 0.1, 0.2...1.0. 
Very quickly you can visualize about where the cutoffs could be to 
simplify things.  Additionally, you start to gain confidence in your 
filtering so you can just discard everything in the 1.0 category and 
then 0.9 and so on...  For me, this turned into ham (0.0, 0.1) and just 
about everything else was spam in groups of 1.0, 0.9, 0.8, 0.5.  I left 
the 0.5 as unsure because I have some weird relatives who send me 
whacked out stuff sometimes.




More information about the Bogofilter mailing list