some test results
David Relson
relson at osagesoftware.com
Thu Feb 13 16:01:06 CET 2003
Greetings,
As you all know, bogofilter recently became able to tag header lines,
specifically Subject: lines. I wanted to know how much that additional
information helps. While I was at it, I also decided to test options
replace_non_ascii and block_on_subnets. Here's the info:
training set 3540 spam, 14026 ham (collected in Oct-Dec, 2002)
test set 1744 spam, 5045 ham (collected in Jan, 2003)
options tested:
def - default config
asc - replace_non_ascii=yes
net - block_on_subnets=yes
tag - tag_header_lines=yes
For each option (or combination of options tested), parmtest.sh builds
wordlists using those options. Here are their sizes:
spamlist goodlist
def: 40,965 120,035
asc: 35,727 119,603
net: 42,306 124,796
tag: 43,239 124,218
asc-tag: 37,899 123,787
asc-net: 37,068 124,364
net-tag: 44,580 128,979
net-tag-asc: 39,240 128,548
Here are the test results. For each of the test mailboxes, the numbers
shown are how many spam were evaluated as spam (s-s), ham (s-h), and as
unsure (s-u) and for each ham how many were evaluated as spam (h-s), ham
(h-h), and unsure (h-u):
s-s s-h s-u h-s h-h h-u
def 1609 3 133 2 4918 124
asc 1608 3 134 2 4918 124
net 1604 5 136 2 4934 108
tag 1604 3 138 2 4918 124
asc-net 1602 4 139 2 4934 108
asc-tag 1603 3 139 2 4918 124
net-tag 1599 4 142 2 4928 114
net-tag-asc 1597 3 145 2 4928 114
The surprise in these results is that the default configuration (without
any of the 3 "helper" features) did the best for detecting spam and that
block_on_subnets did the best for recognizing ham.
These results are _not_ what I expected. I expected _each_ of the
additional features to improve bogofilter's accuracy. Since the default
values were used for ROBS, ROBX, SPAM_CUTOFF, and HAM_CUTOFF, it's possible
that additional experiments could improve the results. Similarly, adding a
train-on-errors step might be valuable.
I'd very much like to see results of comparable tests by others of you.
David
More information about the Bogofilter
mailing list