New vs Old

David Relson relson at osagesoftware.com
Thu Mar 25 13:50:09 CET 2004


Greg,

Here's a table with bogofilter's scoring parameters:

                cur     new
robs            0.010   0.0178
robx            0.415   0.52
min_dev         0.1     0.375
spam_cutoff     0.95    0.99
ham_cutoff      0.00    0.00    (bi-state)
ham_cutoff      0.10    0.45    (tri-state)

I'm noticing 3 of the differences and am wondering about them.  

First, robx is changing from slightly hammish to slightly spammish.  Our
traditional preference of false negatives (rather than false positives)
has had us prefer a hammish value.

Second, min_dev has increased significantly.  This can be thought of as
changing scoring from ignoring neutral tokens to using extrema.

Third, the increased spam_cutoff seems more conservative, i.e. a message
has to score really high to be labeled spam.

There are some big differences between this tuning run and the original
effort in 2002.  We've got lots more experience and we know more.  In
addition we have collected a large test corpus for testing.  Lastly, we
have bogotune -- our search and detect tool for parameters.

Considering all this, and the changes in min_dev and spam_cutoff (in
particular), I'm wondering if bogofilter might need different parameters
for large and small wordlists.  

Unfortunately, I don't know how to do a meaningful test with small
wordlists.

David

-- 
David Relson                   Osage Software Systems, Inc.
relson at osagesoftware.com       Ann Arbor, MI 48103
www.osagesoftware.com          tel:  734.821.8800




More information about the Bogofilter mailing list