Performance of Bogofilter, etc.
Peter Bishop
pgb at adelard.com
Fri Jul 4 10:08:23 CEST 2003
On 3 Jul 2003 at 19:14, Forrest Aldrich wrote:
> I've been using SpamAssassin, which I believe has a performance hit with
> its perl design (though it's certainly suitable for my personal system).
I think bogofilter can process around 100 message per second.
Usedul on a mail server, but I would imagine that speed is not an issue on
your system.
> Someone recently mentioned that they felt the Bayes implementation of SA
> was superior.
>
> I'm curious about input about BogoFilter, compared to others (not just SA),
> and any performance issues/benchmarks.
I don't know if there have been any comparative studies of different
filters. Unlike SpamAssassin, bogofilter performance can vary considerably
as the filter need to be "trained".
> I presume one could somehow port over the SA database files for use with
> BogoFilter (I saw something in the FAQ but I believe that has to do with
> messages, not the database).
I am not sure you can port the database as they are not the same.
The SpamAssassin database is a set of rules that look for key spam
indicators.
The bogofilter database just consists of two lists of words found in good
messages (ham) and in spam. To train bogofilter you have to feed messages
into it and tell bogofilter whether the message is spam or ham (so the
words can be added to the correct list). Filtering performance improves as
more messages are used.
The only way SpamAssassin helps is that it could be used to identify
whether the message is ham or spam prior to feeding the message into
bogofilter during training.
However if SpamAssassin gets it wrong, bogofilter performance can suffer.
So if you already know which messages are ham or spam, you don't need
SpamAssassin.
What you *do* need several hundred ham and spam messages to do the
training. You can get sample spam messages from ftp.spamarchive.org
My own database has been trained with around 2000 spams and 1700 hams.
Performance-wise, with my database, it fails to detect about 1% of spams.
And about 0.1% of hams are marked as spam.
Note to the bogofilter list:
Should the bogofilter package provide a pre-built database that can be used
as a starting point? (e.g. trained with typical spam and maybe some fairly
bland ham messages)
--
Peter Bishop
pgb at adelard.com
pgb at csr.city.ac.uk
More information about the Bogofilter
mailing list