Exclusion Intervals

Wed Jun 30 23:12:00 CEST 2004

On Wed, 30 Jun 2004 14:35:48 -0400
Tom Anderson wrote:

...[snip]...

> That would be nice.  But now I'm more concerned about why the
> probabilities don't really reflect what's in the database.  If token X
> occurs in 100 spams and 1 ham, then X is spammy no matter how many
> spams I have in my.MSG_COUNT, even if the ratio is 1000:1.

Hi Tom,

If I have 1000 spam and 10 ham and _every_ message has "relson" in it,
does that mean a message with "relson" is 100 times more likely to be
spam than ham?  I don't think so.

Computing  "token_count / message_count" adjusts for the different
number of spam and ham messages in the database.  After that adjustment,
we have the probabilities that the presence of "whatever" indicates spam
and the probability that it represents ham.  It's pretty simple.

Just using your "X" and and 100:1 messages is overly simplistic.  It
sounds like "if I have 10000 spam messages and 1000 ham messages in my
wordlist, then I can simply assign 10000/(10000+1000) as the probability
a new message is spam."  Blazingly fast, produces the right percentage
of spam and ham, and low storage requirements since there's no need for
a wordlist.

Alternatively, yesterday I received 1008 messages of which 596 were
spam, which means I can assign a 59.6% spam score to any given message.

As the above examples show, neither token counts alone nor message
counts alone is sufficient for a decent classification.  The ratio of
the two counts gives the probability of a token occurring in spam.

HTH,

David