When is spam_cutoff too low?
Tom Anderson
tanderso at oac-design.com
Wed Dec 8 15:18:59 CET 2004
From: "Matej Cepl" <cepl at surfbest.net>
> waited for couple of days, what happens). However, now I have there 0.87
> and still plenty of false negatives (no false positive so far) and I am
> getting afraid, when I will hit the magic limit, when the bogofilter begin
> massively misclassify ham as spam. Is there such a limit? Should I do
> something else than tuning down spam_cutoff?
Look at all of your ham and find the highest scoring one over the past 1-3
months. You can set your spam cutoff to just above that value and not fear
getting false positives. It's still possible of course, but highly
unlikely. I find that by using -u to register all of my hams automatically,
my highest ham score is around 0.01.
The other magic numbers to look out for are robx and min_dev. Min_dev
should be sufficiently large so that the first few times a token is seen, it
is not used to classify yet, and so that tokens in lots of ham and spam
alike are not used to classify. My min_dev is 0.2 which means that tokens
scoring between 0.3 and 0.7 do not contribute to email classifications since
they are not polarized enough. Robx should probably be somewhere just south
of 0.5 and within your min_dev range. Robx is what tokens get classified as
when they've never been seen before. It should be within your min_dev range
(0.5 +/- min_dev) so that you don't use a token to classify an email the
first time you see that token. Your spam_cutoff should not go below robx...
if it did, then in the odd circumstance of receiving an email with zero
known tokens, it'd be classified as spam when in actuality, it should be
unsure. This is because bogofilter would score such an email equal to robx.
Perhaps that should change, and bogofilter should score an email with zero
known tokens as (spam_cutoff + ham_cutoff)/2 in order to ensure it being
exactly unsure, but unless that changes, spam_cutoff should remain above
robx which should remain above 0.5-min_dev.
Here are my values:
robx=0.46, robs=0.2, min_dev=0.2, spam_cutoff=0.465, ham_cutoff=0.1,
thresh_update=0.01
Recently, I've been getting zero false positives, about 1 false negative
every 2-3 days, and about 5 unsures per day. I could probably lower my
spam_cutoff some more in order to reduce the unsures since I never see ham
in there. Correctly filtered emails were about 100-200 per day until I
installed some DNSBLs, and now they're down to about 30-40 per day which
makes it easier to occassionally scan for false positives.
Tom
P.S. In fact, I've just decided to lower my spam_cutoff:
robx=0.41, robs=0.2, min_dev=0.2, spam_cutoff=0.42, ham_cutoff=0.1,
thresh_update=0.01
I lowered my robx so that it was still below the spam_cutoff but still
>0.5-min_dev.
More information about the Bogofilter
mailing list