Performance of Bogofilter, etc.

Peter Bishop pgb at adelard.com
Fri Jul 4 10:08:23 CEST 2003


On 3 Jul 2003 at 19:14, Forrest Aldrich wrote:

> I've been using SpamAssassin, which I believe has a performance hit with 
> its perl design (though it's certainly suitable for my personal system).

I think bogofilter can process around 100 message per second.
Usedul on a mail server, but I would imagine that speed is not an issue on 
your system.

> Someone recently mentioned that they felt the Bayes implementation of SA 
> was superior.
> 
> I'm curious about input about BogoFilter, compared to others (not just SA),
> and any performance issues/benchmarks.

I don't know if there have been any comparative studies of different 
filters. Unlike SpamAssassin, bogofilter performance can vary considerably 
as the filter need to be "trained".

> I presume one could somehow port over the SA database files for use with 
> BogoFilter (I saw something in the FAQ but I believe that has to do with 
> messages, not the database).
 
I am not sure you can port the database as they are not the same. 

The SpamAssassin database is a set of rules that look for key spam 
indicators. 

The bogofilter database just consists of two lists of words found in good 
messages (ham) and in spam. To train bogofilter you have to feed messages 
into it and tell bogofilter whether the message is spam or ham (so the 
words can be added to the correct list). Filtering performance improves as 
more messages are used.

The only way SpamAssassin helps is that it could be used to identify 
whether the message is ham or spam prior to feeding the message into 
bogofilter during training.

However if SpamAssassin gets it wrong, bogofilter performance can suffer.
So if you already know which messages are ham or spam, you don't need
SpamAssassin.

What you *do* need several hundred ham and spam messages to do the 
training. You can get sample spam messages from ftp.spamarchive.org 

My own database has been trained with around 2000 spams and 1700 hams.
Performance-wise, with my database, it fails to detect about 1% of spams. 
And about 0.1% of hams are marked as spam.

Note to the bogofilter list:
Should the bogofilter package provide a pre-built database that can be used 
as a starting point? (e.g. trained with typical spam and maybe some fairly 
bland ham messages)


-- 
Peter Bishop 
pgb at adelard.com
pgb at csr.city.ac.uk






More information about the Bogofilter mailing list