Bogofiler with a specified wordlist.db

Tom Anderson tanderso at oac-design.com
Thu Apr 6 01:17:56 CEST 2006


mouss wrote:
> sure but bayesian filters require training. so
> - their accuracy is poor at start.
> - for users who don't retrain the filter, accuracy may never be 
> satisfactory. (using a "global" wordlist may help, but not if these 
> users receive mail that is different from the one used to train the 
> global db).

Accuracy has generally been above 95% within 48 hours and a few dozen 
messages when training on error from scratch.  If starting with a corpus 
of previous messages, it's much faster.

If you don't want to train the filter, don't use Bogofilter or any other 
statistical filter.  You won't be happy with the result, even if you 
pair it with a procedural filter.

> Chris Idea is to "shoulder" (or boost?) bogo using SA. I would love to 
> see the results of this. (I find this better than using public corpuses).
> 
> for example, when you install bogo for the first time, you use SA too. 
> if SA score is "sure" (<0 or >10 for instance), then train bogofilter 
> with this email. There is still a risk of error (FN or FP) of course, 
> but for users who don't retrain bogofilter, this is better than nothing.
> 
> once the user's wordlist is "mature", SA can be skipped for that user.

If you don't want to train, Bogofilter should be skipped altogether... 
just use Spam Assassin exclusively.  The idea of "boosting" Bogofilter 
with Spam Assassin is like hand-holding some grandma through her first 
time creating a Word document and then walking away and letting her 
loose on a Sendmail config.

>>I feel that adding Spam Assassin to the mix would only introduce false 
>>positives, of which I currently recieve zero.
> 
> one can reduce this by using a conservative setup (disable or lower the 
> score of rules that generate FPs).

You can't reduce false positives below zero.  Just train the 4 errors 
per week in Bogofilter and be done with it.  Spam Assassin doesn't bring 
anything to the table unless you want complete and total automation and 
don't mind receiving lots of FNs and discarding FPs.  And if that's what 
you want, then Bogofilter isn't for you.  Bogofilter provides 
near-complete automation and unrivalled accuracy, but you have to 
provide feedback every few days to keep it on track.

This can certainly be a little harder to achieve in a multiuser setup, 
but in that case, I would recommend training a global wordlist on the 
input from a honeypot (known spam) and that of your SMTP server (known 
ham).  You could probably even manage to use per-user wordlists trained 
with their own outgoing mail as ham and the honeypot as spam.  It may 
not be quite as good as training their own actual errors, but I'd wager 
it'd be much better than Spam Assassin, and it would be completely 
automated.

Tom




More information about the Bogofilter mailing list