New version

Tue Mar 16 18:55:54 CET 2004

On 20040316 (Tue) at 1831:08 +0100, Boris 'pi' Piwinger wrote:
> Greg Louis wrote:
> 
> >>  I'd much rather get it as unsure, and at least have a chance to
> >> register it as spam once.  Therefore, the robx ought to be less than the
> >> spam_cutoff
> > 
> > Sorry but this betrays a fundamental misconception on your part.  The
> > values of x and the spam cutoff are not to be compared in that way,
> > because they are not linearly related _at_all_.  Remember, the score
> > that the spam cutoff is compared against is calculated by Fisher's
> > method of combining probabilities, not the old Robinson geometric-mean
> > thing; a message consisting of ten tokens with fw of 0.532 (smaller
> > than the spam_cutoff, although not much so) would still score 0.5637.
> > The value of robx is supposed to be a guess at how likely it is that an
> > unknown token is to be found in spam.  In my message corpus, that
> > likelihood really is around 0.6, so that's what the prior should be.
> 
> To save Tom here;-) If you have a message with no
> significant token whatsoever, than they are directly compared.
> 
That's only true if every token's fw is within min_dev of 0.5.  If you
have any unknowns and x is outside 0.5 +/- mindev, it's not true.  But
yes, if you want an even worse straw man than Tom's all-unknowns
message ;) an all-0.5 message will be scored at robx and (in my case)
classed as spam.

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |