subject-tagging test results
Greg Louis
glouis at dynamicro.on.ca
Fri Feb 14 14:06:39 CET 2003
This will be written up later today or tomorrow on
http://www.bgl.nu/bogofilter, for those who are interested in the
details.
I ran a test with 21,969 spams and 19,823 nonspams in the training
corpus; the test involved two runs, each with 5,948 spams and 3,204
nonspams. There were two factors: with and without David's new header
tagging feature, and min_dev varying from 0.025 to 0.125 in steps of
0.025. At each point, the nonspams were used to establish a
spam-cutoff value that would give 0.1% false positives, and then the
number of false negatives at that cutoff was determined.
Without header tagging, the optimum min_dev value was 0.075, and at
that value there were 3.9% false negatives (233 per run, on average).
With header tagging, the optimum min_dev was 0.05, with 3.5% false
negatives (214 per run). At low min_dev values (0.025 and 0.05),
tagging was beneficial; at 0.075 it was neither beneficial nor harmful;
at 0.1 and 0.125 tagging yielded more (though not many more) false
negatives.
The min_dev value of 0.125 gave the worst results, both without and
with tagging (4.9 and 5.0% respectively). I mention this to underline
the point that, although _optimum_ performance depends on appropriate
tuning of bogofilter's various parameters, it's not necessary to get it
_exactly_right_ to enjoy the benefit of spam filtering.
Speaking of parameters, Robinson's x in this experiment was 0.415 and s
was 5.0e-7. The spam cutoffs determined from the nonspam test corpora
were quite high, ranging from 0.991 to 0.998; that reflects the fact
that the nonspam corpora included quite a few spammy-looking
newsletters, so the high cutoffs are needed to avoid false positives.
Even so, we caught up to 96.5% of spam -- not that bad a result.
--
| G r e g L o u i s | gpg public key: |
| http://www.bgl.nu/~glouis | finger greg at bgl.nu |
| Help free our mailboxes. Include |
| http://wecanstopspam.org in your signature. |
More information about the Bogofilter
mailing list