spam cutoff less than neutral?
tallison at tacocat.net
Tue Feb 24 04:21:32 EST 2004
Tom Anderson wrote:
> On Tue, 2004-02-24 at 03:03, Boris 'pi' Piwinger wrote:
>>>Cutoffs by definition ought to be at or outside of
>>>the min_dev range.
>>Not at all.
>>>Else, min_dev should really be changed to be
>>>consistent with your cutoff philosophy.
>>It is absolutely consistent. I still don't get you point.
> If your min_dev is excluding all tokens between 0.35 and 0.75 as being
> unable to influence a ham/spam decision because they are too
> inconclusive, then it follows that a combined ranking within this range
> is also ambiguous. If a message classification of 0.55 is definitely
> spam, then an individual token ranking 0.55 should also be indicative of
> spam. This is an inconsistency to profess on the one hand that 0.55 is
> dubious, but on the other hand to declare it conclusive.
I am not statistician by any stretch, but what you suggested is not
You are taking one set of parameters (min_dev) which are applied against
individual tokens to measure if they should be included and comparing
that against another set of parameters (ham_cutoff, spam_cutoff) which
are a summarized measure of all those tokens.
To put in statistically simplied terms that I can understand, you are
saying something like this:
I want to take the average of 100 dice rolls and I will only consider a
success when the average is > 3.0 and failure when < 2.0.
If you have a min_dev of 1.0, then all dice rolls of 3 or 4 will not be
counted, only 1, 2, 5, 6. (I am assuming here that min_dev is a fixed
variance around 0.5) You can still have an average score between 3 and 4
despite your individual scores not being in that range.
More information about the Bogofilter