spam cutoff less than neutral?

Tom Allison tallison at tacocat.net
Tue Feb 24 10:21:32 CET 2004


Tom Anderson wrote:
> On Tue, 2004-02-24 at 03:03, Boris 'pi' Piwinger wrote:
> 
>>>Cutoffs by definition ought to be at or outside of
>>>the min_dev range.  
>>
>>Not at all.
>>
>>
>>>Else, min_dev should really be changed to be
>>>consistent with your cutoff philosophy.
>>
>>It is absolutely consistent. I still don't get you point.
> 
> 
> If your min_dev is excluding all tokens between 0.35 and 0.75 as being
> unable to influence a ham/spam decision because they are too
> inconclusive, then it follows that a combined ranking within this range
> is also ambiguous.  If a message classification of 0.55 is definitely
> spam, then an individual token ranking 0.55 should also be indicative of
> spam.  This is an inconsistency to profess on the one hand that 0.55 is
> dubious, but on the other hand to declare it conclusive.
> 
> Tom
> 

I am not statistician by any stretch, but what you suggested is not 
necessarily true.

You are taking one set of parameters (min_dev) which are applied against 
individual tokens to measure if they should be included and comparing 
that against another set of parameters (ham_cutoff, spam_cutoff) which 
are a summarized measure of all those tokens.

To put in statistically simplied terms that I can understand, you are 
saying something like this:
I want to take the average of 100 dice rolls and I will only consider a 
success when the average is > 3.0 and failure when < 2.0.

If you have a min_dev of 1.0, then all dice rolls of 3 or 4 will not be 
counted, only 1, 2, 5, 6.  (I am assuming here that min_dev is a fixed 
variance around 0.5) You can still have an average score between 3 and 4 
despite your individual scores not being in that range.





More information about the Bogofilter mailing list