bogofilter producing poor results

Tue Nov 12 13:27:50 CET 2002

On 20021111 (Mon) at 1736:35 -0800, William Ono wrote:
> Hi all,
> 
> After reading Greg Louis's paper on comparing the two algorithms, I did
> some informal testing on email that I have collected over the past few
> months.  Unfortunately it's less clear in my results which is the clear
> winner, and in fact neither algorithm seems to perform acceptably.
> 
> The training set is two mailboxes containing manually separated spam and
> non-spam, from email collected over a period of about six months.  (These
> are all emails sent directly to me, as I have procmail filter mailing
> list posts separately.)  As they were processed by spamassassin, I first
> ran them through spamassassin again with the -d option which strips the
> headers added by it previously.  Then I ran them through bogofilter
> first using the Robinson algorithm:
> 
> $ bogofilter -V
> 
> bogofilter version 0.8.0 Copyright (C) 2002 Eric S. Raymond
> 
> bogofilter comes with ABSOLUTELY NO WARRANTY. This is free software, and you
> are welcome to redistribute it under the General Public License. See the
> COPYING file with the source distribution for details.
> 
> $ rm goodlist.db
> $ rm spamlist.db
> $ zcat corpus.spam.gz | formail -s spamassassin -d > f-corpus.spam
> $ zcat corpus.ham.gz | formail -s spamassassin -d > f-corpus.ham
> $ bogofilter -r -s -v < f-corpus.spam
> # 626200 words, 1181 messages
> $ bogofilter -r -n -v < f-corpus.ham
> # 286022 words, 855 messages

That's a small training set.  Bogofilter, at least in my hands, began
to perform better (around 5% false negatives and <1% false positives)
when my training set grew to about 4300 nonspam and 1800 spam (I had no
spam archive to start with, but I used old nonspams; hence the
lopsidedness).  Now I'm at 6500 and 7200 respectively, and I'm getting
around 2% false negatives and less than 0.5% false positives these
days.

> Then I took another set of manually separated spam and non-spam, from
> email collected in the month or so directly following the email used in
> the training set.  Again, I filtered these back through spamassassin,
> and then ran them through bogofilter:

> So using the Robinson algorithm I have 36 false negatives and 2 false
> positives out of a total of 350 emails, with 222 of them being spam and
> 128 of them being non-spam.

> So now using the Graham algorithm I have 16 false negatives and 8 false
> positives out of a total of 350 emails, again with 222 of them being
> spam and 128 of them being non-spam.
> 
> What's this mean?  As a user, to me it means that both algorithms still
> need a little tweaking.

You can tune both; for testing, set SPAM_CUTOFF to a level that gives
just one false positive, and compare false negatives.  But again,
random variation in "spamminess" of nonspams and vice versa suggests
that one needs to be chary of drawing conclusions from so small a test
corpus.

> The Robinson algorithm seems to be safer,
> categorizing fewer emails in general to be spam, but it also fails to
> catch a lot of the spam.  The Graham algorithm seems to be more
> aggressive at catching spam, but it also catches lots of non-spam.  So
> there's no clear winner here.

Nothing to do with the _algorithm_ in either case; for Robinson you
need to tune SPAM_CUTOFF and perhaps ROBS and ROBX (the default value
of 0.2 for the latter is probably a bit low, I use 0.415), while for
Graham SPAM_CUTOFF clearly needs bumping up a bit for your data.

> Is, perhaps, my training set too small, despite representing some six months
> of email collection?  Or perhaps could it be the type of email that I
> receive that is problematic?

Could be the latter, but I definitely suspect the former.

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |