How to deal with extremely high spam levels
Tom Allison
tallison at tacocat.net
Tue Jun 22 23:10:25 CEST 2004
Bob Vincent wrote:
> Bogofilter is apparently designed for the situation where the number
> of spams per day roughly equals the number of non-spams per day.
>
> In my situation, the ratio exceeds 100:1. In the two weeks I've been
> re-training bogofilter, I've collected:
>
It will take some time. But if you think about how bogofilter operates,
this should not be too much of a problem. bogofilter, especially in a
tri-state, has to work out two sets of scores: how spammy is it, and how
hammy is it?
You are working up a great sample for the spammy question but not much
for the hammy side of the equation. I expect that you will find it
becoming very good at detecting spam to the point where you will start
finding most of your unsures are actually ham.
When I started running bogofilter, it was kind of "dumb" until I had
about 100 emails in each category. At that point it started to show
some consistency and intelligent guessing. By the time I got to 500
emails in each is was pretty much a shoe-in.
The other thing you can do to improve your performance, even without
bogotune, is to start checking to see what kind of scores you are
getting in your unsure and modify the cutoffs to approach those scores.
One thing that I did for fun was to actually create a filter of eleven
mailboxes, each filtering on bogofilter scores of 0.0, 0.1, 0.2...1.0.
Very quickly you can visualize about where the cutoffs could be to
simplify things. Additionally, you start to gain confidence in your
filtering so you can just discard everything in the 1.0 category and
then 0.9 and so on... For me, this turned into ham (0.0, 0.1) and just
about everything else was spam in groups of 1.0, 0.9, 0.8, 0.5. I left
the 0.5 as unsure because I have some weird relatives who send me
whacked out stuff sometimes.
More information about the Bogofilter
mailing list