flaw in spamicity calculation

Fri Sep 20 15:24:48 CEST 2002

Hmmm, I hadn't thought about this.  I think you do understand the point
of this method and I agree that this is a problem.  The weighted idea
mentioned in another post sounds promising.  Now if I could just find
time to test some of these things...

Doug

On Thu, Sep 19, 2002 at 01:43:01PM -0700, Michael Elkins wrote:
> Doug Beardsley wrote:
> > I recommend that an implementation of Gary Robinson's equation
> > f(w) = (a + s) / ((a*b) + c).
> 
> I tried playing around with this some, but had poor results.  It
> classified practically all of my legitimate email as spam.  Why?
> Because I have many more messages in my spam corpus than my legitimate
> corpus.  This makes any word that appears more times in your spam corpus
> be considered a spam word, instead of the current method which I believe
> is more robust.
> 
> Consider the following example:
> 
> Say I have a good mailbox with 100 messages.  The word "foo" appears in
> 50 of those messages.  Now say I have a spam corpus of 1000 messages, of
> which the word "foo" appears 300 times.
> 
> With the original method, I see that
> 	p(foo|good)= 50/100= 0.5  (it appears in half my good messages)
> 	p(foo|bad)=300/1000=0.333... (it appears in roughly 1/3 of spam msgs)
> 
> So I calculate
> 	p(foo)	= p(food|bad) / ( p(foo|good) + p(foo|bad) )
> 		= 0.3333 / ( 0.3333 + 0.5 )
> 		= 0.39
> Which tends to tell me that the word "foo" is slightly more likely to be
> indicative of non-spam, which is what I agree with given that it occurs
> in half my good mail, and 1/3 of my bad mail.
> 
> Now consider the method proposed above:
> 	p(foo)	= (1 + nbad) / (2 + nbad + ngood)
> 		= (1 + 300) / (2 + 300 + 100)
> 		= 0.75
> The author of the article this algorithm comes from claimed that it
> reduces to the same as Paul's, but it clearly doesn't.  Anyone who has a
> lot more spam will end up having innocent words classified as spam
> simply because there are more messages!
> 
> Or did I completely misunderstand the point of this method?
> 
> ---------------------------------------------------------------------
> FAQ: http://bogofilter.sourceforge.net/bogofilter-faq.html
> To unsubscribe, e-mail: bogofilter-unsubscribe at aotto.com
> For summay digest subscription: bogofilter-digest-subscribe at aotto.com
> For more commands, e-mail: bogofilter-help at aotto.com

For summay digest subscription: bogofilter-digest-subscribe at aotto.com