Exclusion Intervals

Wed Jun 30 23:46:20 CEST 2004

From: "David Relson" <relson at osagesoftware.com>
> If I have 1000 spam and 10 ham and _every_ message has "relson" in it,
> does that mean a message with "relson" is 100 times more likely to be
> spam than ham?  I don't think so.

No, that's true, it should be neutral.  However, if you have 1000 spam and
10 ham, and 90 spam messages have "relson" in them, and only 1 ham message
has "relson" in it, does that mean "relson" is hammy?  It does according to
bogofilter, and maybe it makes sense statistically, but it is intuitively
false.  All it takes is a few legitimate discussions of otherwise spammy
topics to screw up all of the spam scores.  Previously I hadn't noticed such
an effect, but my wordlist must really be getting out of balance now...
.MSG_COUNT is showing 247017 spam and 8841 ham... that's about 28:1.  Every
token which has been seen in both ham and spam is automatically 28x more
hammy?  This just doesn't make sense.

Let's say I've seen "blah" 28 times in spams.  It's a new product that all
the spammers are raving about to their victims.  If I receive a single email
(say from this list) talking about how "blah" has become a hot spam topic,
I've now invalidated 28 registrations of that token, and if there's a reply
to that ham, now I need 56 registrations of spam to make up for it.

> Computing  "token_count / message_count" adjusts for the different
> number of spam and ham messages in the database.  After that adjustment,
> we have the probabilities that the presence of "whatever" indicates spam
> and the probability that it represents ham.  It's pretty simple.
>
> Just using your "X" and and 100:1 messages is overly simplistic.  It
> sounds like "if I have 10000 spam messages and 1000 ham messages in my
> wordlist, then I can simply assign 10000/(10000+1000) as the probability
> a new message is spam."  Blazingly fast, produces the right percentage
> of spam and ham, and low storage requirements since there's no need for
> a wordlist.

No, I'm saying that's the probability that a _token_ is spam, not the whole
message.  Do the chi square thing with these token probabilities for the
full email score.  I can see how that would fail too when every message
contains a given token though.

While I see your point, do you see mine?

Tom