New version

Greg Louis glouis at dynamicro.on.ca
Tue Mar 16 13:46:04 CET 2004


On 20040316 (Tue) at 0628:15 -0600, Bill McClain wrote:
> On Tue, 16 Mar 2004 06:54:48 -0500
> Tom Allison <tallison at tacocat.net> wrote:
> 
> > Perhaps something to the effect of:
> > min_dev: you want to make sure your robx value is within the range of 
> > 0.50 +/- min_dev.  If you don't then all your new tokens will be 
> > included in calculations and have strange results on your scores.
> 
> I don't think that's generally true.

I agree.  There are situations -- a small training database is one of
them -- where it makes sense not to consider unknowns, and making sure
min_dev excludes them is a way to avoid swamping your valid tokens with
priors.  But some of us find that a very small min_dev works really
well (bogotune will tell you, if you have enough messages to run it).

> I'm currently using:
> 
>     robx=0.400000
>     min_dev=0.020
>     robs=0.0100
>     spam_cutoff=0.282       
>     ham_cutoff=0.043        
> 
> ...and getting 1% false negative, 0.1% false positive. A narrow range of
> eqivocal tokens (0.48-0.52) are not checked when scoring, and
> previosuly unseen tokens are scored 0.4, which is still spammish using
> my cutoffs.

robx        = 0.610600 (6.11e-01)
robs        = 0.017800 (1.78e-02)
min_dev     = 0.020000 (2.00e-02)
ham_cutoff  = 0.281000 (2.81e-01)
spam_cutoff = 0.532200 (5.32e-01)

gives me 1.1% fn and I haven't had an fp in 8 weeks now (150,000-odd
messages).  Same basic setup: a somewhat spammy robx and minimal
minimum deviation.  Unknowns bias the scoring spamward, which is ok,
because -- especially these days -- spams do contain more unknowns. 
Ok, at least, if you have enough registered nonspam to balance out that
bias with strong nonspam values.  For example, the training db to which
the above parameter values apply has:
                                 spam   good
.MSG_COUNT                      23433  21659
and there are 784,377 tokens in that wordlist, 275,413 of which appear
in nonspam.

I don't know exactly what we should say in the FAQ -- probably just
"don't play with the default parameters (except spam_cutoff) till you
have enough messages to run bogotune, or enough experience to know what
you're doing, or enough curiosity to enjoy trial and error :)"

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |




More information about the Bogofilter mailing list