flaw in spamicity calculation
Michael Elkins
me at sigpipe.org
Thu Sep 19 22:43:01 CEST 2002
Doug Beardsley wrote:
> I recommend that an implementation of Gary Robinson's equation
> f(w) = (a + s) / ((a*b) + c).
I tried playing around with this some, but had poor results. It
classified practically all of my legitimate email as spam. Why?
Because I have many more messages in my spam corpus than my legitimate
corpus. This makes any word that appears more times in your spam corpus
be considered a spam word, instead of the current method which I believe
is more robust.
Consider the following example:
Say I have a good mailbox with 100 messages. The word "foo" appears in
50 of those messages. Now say I have a spam corpus of 1000 messages, of
which the word "foo" appears 300 times.
With the original method, I see that
p(foo|good)= 50/100= 0.5 (it appears in half my good messages)
p(foo|bad)=300/1000=0.333... (it appears in roughly 1/3 of spam msgs)
So I calculate
p(foo) = p(food|bad) / ( p(foo|good) + p(foo|bad) )
= 0.3333 / ( 0.3333 + 0.5 )
= 0.39
Which tends to tell me that the word "foo" is slightly more likely to be
indicative of non-spam, which is what I agree with given that it occurs
in half my good mail, and 1/3 of my bad mail.
Now consider the method proposed above:
p(foo) = (1 + nbad) / (2 + nbad + ngood)
= (1 + 300) / (2 + 300 + 100)
= 0.75
The author of the article this algorithm comes from claimed that it
reduces to the same as Paul's, but it clearly doesn't. Anyone who has a
lot more spam will end up having innocent words classified as spam
simply because there are more messages!
Or did I completely misunderstand the point of this method?
For summay digest subscription: bogofilter-digest-subscribe at aotto.com
More information about the Bogofilter
mailing list