subject-tagging test results

Greg Louis glouis at dynamicro.on.ca
Fri Feb 14 14:06:39 CET 2003


This will be written up later today or tomorrow on
http://www.bgl.nu/bogofilter, for those who are interested in the
details.

I ran a test with 21,969 spams and 19,823 nonspams in the training
corpus; the test involved two runs, each with 5,948 spams and 3,204
nonspams.  There were two factors: with and without David's new header
tagging feature, and min_dev varying from 0.025 to 0.125 in steps of
0.025.  At each point, the nonspams were used to establish a
spam-cutoff value that would give 0.1% false positives, and then the
number of false negatives at that cutoff was determined.

Without header tagging, the optimum min_dev value was 0.075, and at
that value there were 3.9% false negatives (233 per run, on average).
With header tagging, the optimum min_dev was 0.05, with 3.5% false
negatives (214 per run).  At low min_dev values (0.025 and 0.05),
tagging was beneficial; at 0.075 it was neither beneficial nor harmful;
at 0.1 and 0.125 tagging yielded more (though not many more) false
negatives.

The min_dev value of 0.125 gave the worst results, both without and
with tagging (4.9 and 5.0% respectively).  I mention this to underline
the point that, although _optimum_ performance depends on appropriate
tuning of bogofilter's various parameters, it's not necessary to get it
_exactly_right_ to enjoy the benefit of spam filtering.

Speaking of parameters, Robinson's x in this experiment was 0.415 and s
was 5.0e-7.  The spam cutoffs determined from the nonspam test corpora
were quite high, ranging from 0.991 to 0.998; that reflects the fact
that the nonspam corpora included quite a few spammy-looking
newsletters, so the high cutoffs are needed to avoid false positives. 
Even so, we caught up to 96.5% of spam -- not that bad a result.

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |
| Help free our mailboxes. Include                   |
|        http://wecanstopspam.org in your signature. |




More information about the Bogofilter mailing list