bogofilter producing poor results

Tue Nov 12 02:36:35 CET 2002

Hi all,

After reading Greg Louis's paper on comparing the two algorithms, I did
some informal testing on email that I have collected over the past few
months.  Unfortunately it's less clear in my results which is the clear
winner, and in fact neither algorithm seems to perform acceptably.

The training set is two mailboxes containing manually separated spam and
non-spam, from email collected over a period of about six months.  (These
are all emails sent directly to me, as I have procmail filter mailing
list posts separately.)  As they were processed by spamassassin, I first
ran them through spamassassin again with the -d option which strips the
headers added by it previously.  Then I ran them through bogofilter
first using the Robinson algorithm:

$ bogofilter -V

bogofilter version 0.8.0 Copyright (C) 2002 Eric S. Raymond

bogofilter comes with ABSOLUTELY NO WARRANTY. This is free software, and you
are welcome to redistribute it under the General Public License. See the
COPYING file with the source distribution for details.

$ rm goodlist.db
$ rm spamlist.db
$ zcat corpus.spam.gz | formail -s spamassassin -d > f-corpus.spam
$ zcat corpus.ham.gz | formail -s spamassassin -d > f-corpus.ham
$ bogofilter -r -s -v < f-corpus.spam
# 626200 words, 1181 messages
$ bogofilter -r -n -v < f-corpus.ham
# 286022 words, 855 messages

Then I took another set of manually separated spam and non-spam, from
email collected in the month or so directly following the email used in
the training set.  Again, I filtered these back through spamassassin,
and then ran them through bogofilter:

$ cat runbf
#!/bin/sh
bogofilter $1
EXITCODE=$?
if [ $EXITCODE -eq 0 ]; then
  echo SPAM
else
  echo NOTSPAM
fi

$ (<f-newspam formail -c -s runbf -r) | grep ^NOTSPAM | wc -l
     36
$ (<f-newspam formail -c -s runbf -r) | grep ^SPAM | wc -l
    186
$ (<f-newham formail -c -s runbf -r) | grep ^NOTSPAM | wc -l
    126
$ (<f-newham formail -c -s runbf -r) | grep ^SPAM | wc -l
      2

So using the Robinson algorithm I have 36 false negatives and 2 false
positives out of a total of 350 emails, with 222 of them being spam and
128 of them being non-spam.

Then I tried the Graham algorithm:

$ rm goodlist.db
$ rm spamlist.db
$ bogofilter -g -s -v < f-corpus.spam
# 626200 words, 1181 messages
$ bogofilter -g -n -v < f-corpus.ham
# 286022 words, 855 messages

$ (<f-newspam formail -c -s runbf -g) | grep ^NOTSPAM | wc -l
     16
$ (<f-newspam formail -c -s runbf -g) | grep ^SPAM | wc -l
    206
$ (<f-newham formail -c -s runbf -g) | grep ^NOTSPAM | wc -l
    120
$ (<f-newham formail -c -s runbf -g) | grep ^SPAM | wc -l
      8

So now using the Graham algorithm I have 16 false negatives and 8 false
positives out of a total of 350 emails, again with 222 of them being
spam and 128 of them being non-spam.

What's this mean?  As a user, to me it means that both algorithms still
need a little tweaking.  The Robinson algorithm seems to be safer,
categorizing fewer emails in general to be spam, but it also fails to
catch a lot of the spam.  The Graham algorithm seems to be more
aggressive at catching spam, but it also catches lots of non-spam.  So
there's no clear winner here.

I'm convinced that probabilistic spam filtering is the way to go, but
at the moment bogofilter is much less accurate than my old and outdated
copy of spamassassin.

Has anyone else experienced such poor performance from bogofilter?  Is,
perhaps, my training set too small, despite representing some six months
of email collection?  Or perhaps could it be the type of email that I
receive that is problematic?

Thanks for any advice.

--
William Ono <a1bformk at tinny.soundwave.net>
PGP 2048R/93BA6AFD E3 64 C5 43 3E B3 2D A6    C6 D7 E3 45 90 24 78 DE