troublesome false negative

Mon Nov 4 01:27:41 CET 2002

David Relson wrote:
> 
> Greetings,
> 
> The Graham and Robinson algorithms are clearly two different ways to 
> calculate spamicity.  I wish they were equally good on all messages, but 
> there seems to be a class of message where one says "spam" and the other 
> says "ham", even though to a human, the message is clearly and obviously 
> spam.
> 
> One of those "obviously spam" messages arrived and Robinson gave it a 
> 0.497731 (ham) rating.  I'm wondering what we can do to bogofilter so 
> that it'll catch messages like this.  The message's subject was "Joke of 
> the Day Nov 2" and the actual subject matter was a bit of joke/story 
> plus a lot of "lose weight,  increase the size of your ..." kind of 
> trash.  Graham gave the message a 0.99000 (spam) rating.
> 
> My goal in writing about this troublesome message is to find some ideas 
> for dealing effectively with it.
> 
> David
> 

How does Graham choose the most interesting words?

It seems to me that Robinson might be a slow learner because there are so 
many more words to work with.  Given that, there will be a greater tendency 
for the overall score to shift towards 0.5 in both spam and ham corpus 
because a greater majority of the words will be coming in as indeterminate words.

I thought that the basic math was the same in either case, just that the 
selection of words was limited in Graham and that there was a ceiling to the 
maximum score available in Graham.

-- 
BOFH excuse #196:

Me no internet, only janitor, me just wax floors.