bogofilter producing poor results

Allison, Thomas Thomas.Allison at ONSTAR.com
Tue Nov 12 15:43:16 CET 2002


Actually ... no.

Currently Bogofilter is generally outperforming spamassassin in catching
spam.
Admittedly it does seem to have a little more trouble with false positives.

I trained bogofilter by just forcing it to follow spamassassin for about a
month and then let it run on it's own.

> -----Original Message-----
> From: William Ono [mailto:a1bformk at tinny.soundwave.net]
> Sent: Monday, November 11, 2002 8:37 PM
> To: bogofilter at aotto.com
> Subject: bogofilter producing poor results
> 
> 
> Hi all,
> 
> After reading Greg Louis's paper on comparing the two 
> algorithms, I did
> some informal testing on email that I have collected over the past few
> months.  Unfortunately it's less clear in my results which is 
> the clear
> winner, and in fact neither algorithm seems to perform acceptably.
> 
> The training set is two mailboxes containing manually 
> separated spam and
> non-spam, from email collected over a period of about six 
> months.  (These
> are all emails sent directly to me, as I have procmail filter mailing
> list posts separately.)  As they were processed by 
> spamassassin, I first
> ran them through spamassassin again with the -d option which 
> strips the
> headers added by it previously.  Then I ran them through bogofilter
> first using the Robinson algorithm:
> 
> $ bogofilter -V
> 
> bogofilter version 0.8.0 Copyright (C) 2002 Eric S. Raymond
> 
> bogofilter comes with ABSOLUTELY NO WARRANTY. This is free 
> software, and you
> are welcome to redistribute it under the General Public 
> License. See the
> COPYING file with the source distribution for details.
> 
> $ rm goodlist.db
> $ rm spamlist.db
> $ zcat corpus.spam.gz | formail -s spamassassin -d > f-corpus.spam
> $ zcat corpus.ham.gz | formail -s spamassassin -d > f-corpus.ham
> $ bogofilter -r -s -v < f-corpus.spam
> # 626200 words, 1181 messages
> $ bogofilter -r -n -v < f-corpus.ham
> # 286022 words, 855 messages
> 
> Then I took another set of manually separated spam and non-spam, from
> email collected in the month or so directly following the 
> email used in
> the training set.  Again, I filtered these back through spamassassin,
> and then ran them through bogofilter:
> 
> $ cat runbf
> #!/bin/sh
> bogofilter $1
> EXITCODE=$?
> if [ $EXITCODE -eq 0 ]; then
>   echo SPAM
> else
>   echo NOTSPAM
> fi
> 
> $ (<f-newspam formail -c -s runbf -r) | grep ^NOTSPAM | wc -l
>      36
> $ (<f-newspam formail -c -s runbf -r) | grep ^SPAM | wc -l
>     186
> $ (<f-newham formail -c -s runbf -r) | grep ^NOTSPAM | wc -l
>     126
> $ (<f-newham formail -c -s runbf -r) | grep ^SPAM | wc -l
>       2
> 
> So using the Robinson algorithm I have 36 false negatives and 2 false
> positives out of a total of 350 emails, with 222 of them 
> being spam and
> 128 of them being non-spam.
> 
> Then I tried the Graham algorithm:
> 
> $ rm goodlist.db
> $ rm spamlist.db
> $ bogofilter -g -s -v < f-corpus.spam
> # 626200 words, 1181 messages
> $ bogofilter -g -n -v < f-corpus.ham
> # 286022 words, 855 messages
> 
> $ (<f-newspam formail -c -s runbf -g) | grep ^NOTSPAM | wc -l
>      16
> $ (<f-newspam formail -c -s runbf -g) | grep ^SPAM | wc -l
>     206
> $ (<f-newham formail -c -s runbf -g) | grep ^NOTSPAM | wc -l
>     120
> $ (<f-newham formail -c -s runbf -g) | grep ^SPAM | wc -l
>       8
> 
> So now using the Graham algorithm I have 16 false negatives 
> and 8 false
> positives out of a total of 350 emails, again with 222 of them being
> spam and 128 of them being non-spam.
> 
> What's this mean?  As a user, to me it means that both 
> algorithms still
> need a little tweaking.  The Robinson algorithm seems to be safer,
> categorizing fewer emails in general to be spam, but it also fails to
> catch a lot of the spam.  The Graham algorithm seems to be more
> aggressive at catching spam, but it also catches lots of non-spam.  So
> there's no clear winner here.
> 
> I'm convinced that probabilistic spam filtering is the way to go, but
> at the moment bogofilter is much less accurate than my old 
> and outdated
> copy of spamassassin.
> 
> Has anyone else experienced such poor performance from 
> bogofilter?  Is,
> perhaps, my training set too small, despite representing some 
> six months
> of email collection?  Or perhaps could it be the type of email that I
> receive that is problematic?
> 
> Thanks for any advice.
> 
> --
> William Ono <a1bformk at tinny.soundwave.net>
> PGP 2048R/93BA6AFD E3 64 C5 43 3E B3 2D A6    C6 D7 E3 45 90 24 78 DE
> 
> ---------------------------------------------------------------------
> FAQ: http://bogofilter.sourceforge.net/bogofilter-faq.html
> To unsubscribe, e-mail: bogofilter-unsubscribe at aotto.com
> For summary digest subscription: bogofilter-digest-subscribe at aotto.com
> For more commands, e-mail: bogofilter-help at aotto.com
> 




More information about the Bogofilter mailing list