New version
Greg Louis
glouis at dynamicro.on.ca
Tue Mar 16 18:47:51 CET 2004
On 20040316 (Tue) at 1055:12 -0500, Greg Louis wrote:
(in reply to a message from Tom Anderson)
> values of x and the spam cutoff ... are not linearly related
> _at_all_. Remember, the score that the spam cutoff is compared
> against is calculated by Fisher's method of combining probabilities,
> not the old Robinson geometric-mean thing; a message consisting of
> ten tokens with fw of 0.532 (smaller than the spam_cutoff, although
> not much so) would still score 0.5637. The value of robx is supposed
> to be a guess at how likely it is that an unknown token is to be
> found in spam. In my message corpus, that likelihood really is
> around 0.6, so that's what the prior should be.
In fact, suppose I received a nonspam consisting of 100 unknown tokens.
My x is set to 0.610612, so that message would score 0.6463; my cutoff
value is 0.5322, so it would be classed as spam. This was your
concern, Tom, and so far it seems a valid one. But suppose there
were just six additional tokens in the message with fw values of
0.001; now the message would score 0.5310 and be classed as unsure.
I deliver unsures, so I'd get the message normally. I train on
unsures, so those hundred new tokens would end up in the training db
with counts of 0 and 1.
My point here is that it would be _really_ unlikely for me to receive a
nonspam that didn't have some clearly nonspammy keywords in it, given
the size of my training db, and it doesn't take many of those to swing
the balance.
--
| G r e g L o u i s | gpg public key: 0x400B1AA86D9E3E64 |
| http://www.bgl.nu/~glouis | (on my website or any keyserver) |
| http://wecanstopspam.org in signatures helps fight junk email. |
More information about the Bogofilter
mailing list