some test results

Thu Feb 13 16:01:06 CET 2003

Greetings,

As you all know, bogofilter recently became able to tag header lines, 
specifically Subject: lines.  I wanted to know how much that additional 
information helps.  While I was at it, I also decided to test options 
replace_non_ascii and block_on_subnets.  Here's the info:

training set    3540 spam,  14026 ham (collected in Oct-Dec, 2002)
test set        1744 spam,   5045 ham (collected in Jan, 2003)

options tested:

         def - default config
         asc - replace_non_ascii=yes
         net - block_on_subnets=yes
         tag - tag_header_lines=yes

For each option (or combination of options tested), parmtest.sh builds 
wordlists using those options.  Here are their sizes:

                 spamlist        goodlist

def:              40,965         120,035

asc:              35,727         119,603
net:              42,306         124,796
tag:              43,239         124,218

asc-tag:          37,899         123,787
asc-net:          37,068         124,364
net-tag:          44,580         128,979

net-tag-asc:      39,240         128,548

Here are the test results.  For each of the test mailboxes, the numbers 
shown are how many spam were evaluated as spam (s-s), ham (s-h), and as 
unsure (s-u) and for each ham how many were evaluated as spam (h-s), ham 
(h-h), and unsure (h-u):

                  s-s  s-h  s-u      h-s  h-h  h-u

def             1609    3  133        2 4918  124

asc             1608    3  134        2 4918  124
net             1604    5  136        2 4934  108
tag             1604    3  138        2 4918  124

asc-net         1602    4  139        2 4934  108
asc-tag         1603    3  139        2 4918  124
net-tag         1599    4  142        2 4928  114

net-tag-asc     1597    3  145        2 4928  114

The surprise in these results is that the default configuration (without 
any of the 3 "helper" features) did the best for detecting spam and that 
block_on_subnets did the best for recognizing ham.

These results are _not_ what I expected.  I expected _each_ of the 
additional features to improve bogofilter's accuracy.  Since the default 
values were used for ROBS, ROBX, SPAM_CUTOFF, and HAM_CUTOFF, it's possible 
that additional experiments could improve the results.  Similarly, adding a 
train-on-errors step might be valuable.

I'd very much like to see results of comparable tests by others of you.

David