subject tagging in production
Greg Louis
glouis at dynamicro.on.ca
Mon Feb 17 18:29:44 CET 2003
Some encouraging figures:
The email I've received between Feb. 7 and this morning in my personal
mailbox has been classified with bogofilter-0.10.2.cvs.20030213, making
use of the new subject-header tagging capability. The training
database was first rebuilt with subject-header tagging as well, of
course. The results are rather nice: 5,474 correctly classified
nonspam; 222 correctly classified spam; 184 nonspam classified as
uncertain; and 2 spam classified as uncertain. I deliver email classed
as uncertain, so in binary terms, these results represent 0 false
positives and 2 false negatives in 5,882 emails of which 224 were spam.
Two spam delivered out of 224 is 0.89%, so there's still room for
improvement, but on the other hand, catching 99.11% of spam is a nice
achievement. If I wanted to hype bogofilter, I could truthfully say
that, in this run, it made only two mistakes (defining a mistake as
either delivering a spam or quarantining a nonspam) in classifying
5,882 emails -- 0.034% error!
The parameters with which this classification was run are those that
seemed to give the best results in my early testing (see
http://www.bgl.nu/bogofilter/subjtag.html for details), namely:
Robinson's s = 5.0e-7
Robinson's x = 0.415
Spam cutoff = 0.98
Nonspam cutoff = 0.3
Minimum deviation = 0.05
(But David gets best results on his email corpora with a higher minimum
deviation and a lower spam cutoff; ymmv -- in fact, ym almost certainly
_will_ v ;)
--
| G r e g L o u i s | gpg public key: |
| http://www.bgl.nu/~glouis | finger greg at bgl.nu |
| Help free our mailboxes. Include |
| http://wecanstopspam.org in your signature. |
More information about the Bogofilter
mailing list