min_dev vs spam_cutoff [was: spam cutoff less than neutral? ]

Tom Anderson tanderso at oac-design.com
Tue Feb 24 14:57:05 CET 2004


On Tue, 2004-02-24 at 07:43, David Relson wrote:
> I think Tom Allison is on track here.  There is little or no relation
> between min_dev and spam_cutoff.  min_dev says ignore neutrally scored
> tokens.  spam_cutoff says to label as spam those messages with lots of
> high scoring tokens.  These parameters address different realms.

Thanks guys, statistics was never my strong suit.  The way I was looking
at it was if you had a message consisting only of 10 tokens scoring 0.6
each, and your spam cutoff is 0.5, then clearly the message should be
spam if these tokens are counted.  However, lacking any strongly scoring
tokens, I can understand why this message would still rightly be
considered unsure, and therefore why these tokens should not be counted.
On the other hand 6 1.0 tokens vs 4 0.0 tokens may be rather spammy. 
Even 5 vs 5 or 4 vs 6 may still be spammy.

> The robx parameter _does_ relate (somewhat) to spam_cutoff.  If you
> recall, it's the value given to unknown words and has a default of
> 0.415.  _If_ you ever got a message with never before seen tokens, you'd
> expect it's score to be 0.415.  Since we prefer false negatives to false
> positives, we want robx to be less than spam_cutoff.  If robx >
> spam_cutoff, then the message of unknown words would be spam.

So I'll continue reducing my spam_cutoff until my spam unsures are
reduced, but not beyond my robx.  Thanks.

> The last step of the Robinson-Fisher algorithm, which is what bogofilter
> users, is a reverse chi-square test.  This test answers the question
> "for the number of tokens scored, and the spamicity seen, what is the
> likelihood that this message is spam?"  Given the nature of the
> chi-square test, it skews a linear 0 to 1 result towards the end points
> (and away from the center).  This means that there is _no_ simple
> (linear)
> relation between min_dev, robx, and spam_cutoff.

Seeing as I'm getting lots of unsures near 0.5, is there any way to pass
in options which will cause this skewing to favor 1.0 a little more?

Tom


> And that's what I know :-)  With luck, Greg will contribute a more
> technical explanation.
> 
> Cheers!
> 
> David
> 
> ---------------------------------------------------------------------
> FAQ: http://bogofilter.sourceforge.net/bogofilter-faq.html
> To unsubscribe, e-mail: bogofilter-unsubscribe at aotto.com
> For summary digest subscription: bogofilter-digest-subscribe at aotto.com
> For more commands, e-mail: bogofilter-help at aotto.com
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20040224/c373facf/attachment.sig>


More information about the Bogofilter mailing list