flaw in spamicity calculation

Thu Sep 19 22:43:01 CEST 2002

Doug Beardsley wrote:
> I recommend that an implementation of Gary Robinson's equation
> f(w) = (a + s) / ((a*b) + c).

I tried playing around with this some, but had poor results.  It
classified practically all of my legitimate email as spam.  Why?
Because I have many more messages in my spam corpus than my legitimate
corpus.  This makes any word that appears more times in your spam corpus
be considered a spam word, instead of the current method which I believe
is more robust.

Consider the following example:

Say I have a good mailbox with 100 messages.  The word "foo" appears in
50 of those messages.  Now say I have a spam corpus of 1000 messages, of
which the word "foo" appears 300 times.

With the original method, I see that
	p(foo|good)= 50/100= 0.5  (it appears in half my good messages)
	p(foo|bad)=300/1000=0.333... (it appears in roughly 1/3 of spam msgs)

So I calculate
	p(foo)	= p(food|bad) / ( p(foo|good) + p(foo|bad) )
		= 0.3333 / ( 0.3333 + 0.5 )
		= 0.39
Which tends to tell me that the word "foo" is slightly more likely to be
indicative of non-spam, which is what I agree with given that it occurs
in half my good mail, and 1/3 of my bad mail.

Now consider the method proposed above:
	p(foo)	= (1 + nbad) / (2 + nbad + ngood)
		= (1 + 300) / (2 + 300 + 100)
		= 0.75
The author of the article this algorithm comes from claimed that it
reduces to the same as Paul's, but it clearly doesn't.  Anyone who has a
lot more spam will end up having innocent words classified as spam
simply because there are more messages!

Or did I completely misunderstand the point of this method?

For summay digest subscription: bogofilter-digest-subscribe at aotto.com