How to deal with extremely high spam levels

David Relson relson at osagesoftware.com
Tue Jun 22 23:24:14 CEST 2004


On Tue, 22 Jun 2004 13:32:50 -0400
Bob Vincent wrote:

> Bogofilter is apparently designed for the situation where the number
> of spams per day roughly equals the number of non-spams per day.
> 
> In my situation, the ratio exceeds 100:1.  In the two weeks I've been
> re-training bogofilter, I've collected:
> 
>   Correctly filtered Ham: 101
>   Unsures registered as Ham: 26
>   Correctly filtered Spam: 1067
>   Unsures registered as Spam: 14177
> 
> Part of the reason for my unusually high spam-load is that I'm
> receiving catch-all emails for several domains.  Part of the reason is
> that my email address is listed in several places on the internet.
> 
> I am unwilling to change email addresses or remove my catch-all
> accounts.  I would rather just filter the crap out at the server.
> 
> However, at this rate, it will take nearly a year to collect enough
> non-spam to run bogotune.  I'm not willing to wait that long.
> 
> If Bogofilter is inadequate to this situation, are there any
> recommendations for how to properly deal with it?

Bob,

You've got two sets of unsures, those that should have been scored as
ham and those that are spam.  'Tis good that you have this info.  It
shows that you're paying attention and have an understanding of what's
going on.

Have you looked at the scores of the two sets of messages?  Likely you
can select a lower spam_cutoff value (that will still be higher than any
of your unsure-ham scores).  Using that value should increase the number
of correctly identified ham and reduce the number of unsure-spam.

Also, are you using the Unsures to train bogofilter so that it can do a
better job in the future?  This is known as "train on error" and should
be an ongoing part of using any bayesian spam filter.

If you've just done an initial training, your wordlist may too small to
fully distinguish ham from spam and that may be the reason you have so
many unsures.

Hope this helps,

David



More information about the Bogofilter mailing list