When is spam_cutoff too low?

Tom Anderson tanderso at oac-design.com
Wed Dec 8 15:18:59 CET 2004


From: "Matej Cepl" <cepl at surfbest.net>
> waited for couple of days, what happens). However, now I have there 0.87
> and still plenty of false negatives (no false positive so far) and I am
> getting afraid, when I will hit the magic limit, when the bogofilter begin
> massively misclassify ham as spam. Is there such a limit? Should I do
> something else than tuning down spam_cutoff?

Look at all of your ham and find the highest scoring one over the past 1-3 
months.  You can set your spam cutoff to just above that value and not fear 
getting false positives.  It's still possible of course, but highly 
unlikely.  I find that by using -u to register all of my hams automatically, 
my highest ham score is around 0.01.

The other magic numbers to look out for are robx and min_dev.  Min_dev 
should be sufficiently large so that the first few times a token is seen, it 
is not used to classify yet, and so that tokens in lots of ham and spam 
alike are not used to classify.  My min_dev is 0.2 which means that tokens 
scoring between 0.3 and 0.7 do not contribute to email classifications since 
they are not polarized enough.  Robx should probably be somewhere just south 
of 0.5 and within your min_dev range.  Robx is what tokens get classified as 
when they've never been seen before.  It should be within your min_dev range 
(0.5 +/- min_dev) so that you don't use a token to classify an email the 
first time you see that token.  Your spam_cutoff should not go below robx... 
if it did, then in the odd circumstance of receiving an email with zero 
known tokens, it'd be classified as spam when in actuality, it should be 
unsure.  This is because bogofilter would score such an email equal to robx. 
Perhaps that should change, and bogofilter should score an email with zero 
known tokens as (spam_cutoff + ham_cutoff)/2 in order to ensure it being 
exactly unsure, but unless that changes, spam_cutoff should remain above 
robx which should remain above 0.5-min_dev.

Here are my values:
robx=0.46, robs=0.2, min_dev=0.2, spam_cutoff=0.465, ham_cutoff=0.1, 
thresh_update=0.01

Recently, I've been getting zero false positives, about 1 false negative 
every 2-3 days, and about 5 unsures per day.  I could probably lower my 
spam_cutoff some more in order to reduce the unsures since I never see ham 
in there.  Correctly filtered emails were about 100-200 per day until I 
installed some DNSBLs, and now they're down to about 30-40 per day which 
makes it easier to occassionally scan for false positives.

Tom

P.S. In fact, I've just decided to lower my spam_cutoff:
robx=0.41, robs=0.2, min_dev=0.2, spam_cutoff=0.42, ham_cutoff=0.1, 
thresh_update=0.01

I lowered my robx so that it was still below the spam_cutoff but still 
 >0.5-min_dev.




More information about the Bogofilter mailing list