New version

Tue Mar 16 18:47:51 CET 2004

On 20040316 (Tue) at 1055:12 -0500, Greg Louis wrote:
(in reply to a message from Tom Anderson)

> values of x and the spam cutoff ... are not linearly related
> _at_all_.  Remember, the score that the spam cutoff is compared
> against is calculated by Fisher's method of combining probabilities,
> not the old Robinson geometric-mean thing; a message consisting of
> ten tokens with fw of 0.532 (smaller than the spam_cutoff, although
> not much so) would still score 0.5637. The value of robx is supposed
> to be a guess at how likely it is that an unknown token is to be
> found in spam.  In my message corpus, that likelihood really is
> around 0.6, so that's what the prior should be.

In fact, suppose I received a nonspam consisting of 100 unknown tokens. 
My x is set to 0.610612, so that message would score 0.6463; my cutoff
value is 0.5322, so it would be classed as spam.  This was your
concern, Tom, and so far it seems a valid one.  But suppose there
were just six additional tokens in the message with fw values of
0.001; now the message would score 0.5310 and be classed as unsure.
I deliver unsures, so I'd get the message normally.  I train on
unsures, so those hundred new tokens would end up in the training db
with counts of 0 and 1.

My point here is that it would be _really_ unlikely for me to receive a
nonspam that didn't have some clearly nonspammy keywords in it, given
the size of my training db, and it doesn't take many of those to swing
the balance.

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |