subject tagging in production

Greg Louis glouis at dynamicro.on.ca
Mon Feb 17 18:29:44 CET 2003


Some encouraging figures:

The email I've received between Feb. 7 and this morning in my personal
mailbox has been classified with bogofilter-0.10.2.cvs.20030213, making
use of the new subject-header tagging capability.  The training
database was first rebuilt with subject-header tagging as well, of
course.  The results are rather nice: 5,474 correctly classified
nonspam; 222 correctly classified spam; 184 nonspam classified as
uncertain; and 2 spam classified as uncertain.  I deliver email classed
as uncertain, so in binary terms, these results represent 0 false
positives and 2 false negatives in 5,882 emails of which 224 were spam.

Two spam delivered out of 224 is 0.89%, so there's still room for
improvement, but on the other hand, catching 99.11% of spam is a nice
achievement.  If I wanted to hype bogofilter, I could truthfully say
that, in this run, it made only two mistakes (defining a mistake as
either delivering a spam or quarantining a nonspam) in classifying
5,882 emails -- 0.034% error!

The parameters with which this classification was run are those that
seemed to give the best results in my early testing (see
http://www.bgl.nu/bogofilter/subjtag.html for details), namely:

Robinson's s      = 5.0e-7
Robinson's x      = 0.415
Spam cutoff       = 0.98
Nonspam cutoff    = 0.3
Minimum deviation = 0.05

(But David gets best results on his email corpora with a higher minimum
deviation and a lower spam cutoff; ymmv -- in fact, ym almost certainly
_will_ v ;)

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |
| Help free our mailboxes. Include                   |
|        http://wecanstopspam.org in your signature. |




More information about the Bogofilter mailing list