min_dev vs spam_cutoff [was: spam cutoff less than neutral? ]

Tue Feb 24 13:43:17 CET 2004

On Tue, 24 Feb 2004 04:21:32 -0500
Tom Allison wrote:

...[snip]...

> I am not statistician by any stretch, but what you suggested is not 
> necessarily true.
> 
> You are taking one set of parameters (min_dev) which are applied
> against individual tokens to measure if they should be included and
> comparing that against another set of parameters (ham_cutoff,
> spam_cutoff) which are a summarized measure of all those tokens.
> 
> To put in statistically simplied terms that I can understand, you are 
> saying something like this:
> I want to take the average of 100 dice rolls and I will only consider
> a success when the average is > 3.0 and failure when < 2.0.
> 
> If you have a min_dev of 1.0, then all dice rolls of 3 or 4 will not
> be counted, only 1, 2, 5, 6.  (I am assuming here that min_dev is a
> fixed variance around 0.5) You can still have an average score between
> 3 and 4 despite your individual scores not being in that range.

My turn to chime in!

I think Tom Allison is on track here.  There is little or no relation
between min_dev and spam_cutoff.  min_dev says ignore neutrally scored
tokens.  spam_cutoff says to label as spam those messages with lots of
high scoring tokens.  These parameters address different realms.

The robx parameter _does_ relate (somewhat) to spam_cutoff.  If you
recall, it's the value given to unknown words and has a default of
0.415.  _If_ you ever got a message with never before seen tokens, you'd
expect it's score to be 0.415.  Since we prefer false negatives to false
positives, we want robx to be less than spam_cutoff.  If robx >
spam_cutoff, 
then the message of unknown words would be spam.

That's point 1.  Here's point 2.

The last step of the Robinson-Fisher algorithm, which is what bogofilter
users, is a reverse chi-square test.  This test answers the question
"for the number of tokens scored, and the spamicity seen, what is the
likelihood that this message is spam?"  Given the nature of the
chi-square test, it skews a linear 0 to 1 result towards the end points
(and away from the center).  This means that there is _no_ simple
(linear)
relation between min_dev, robx, and spam_cutoff.

And that's what I know :-)  With luck, Greg will contribute a more
technical explanation.

Cheers!

David