troublesome false negative
Tom Allison
tallison at tacocat.net
Mon Nov 4 01:27:41 CET 2002
David Relson wrote:
>
> Greetings,
>
> The Graham and Robinson algorithms are clearly two different ways to
> calculate spamicity. I wish they were equally good on all messages, but
> there seems to be a class of message where one says "spam" and the other
> says "ham", even though to a human, the message is clearly and obviously
> spam.
>
> One of those "obviously spam" messages arrived and Robinson gave it a
> 0.497731 (ham) rating. I'm wondering what we can do to bogofilter so
> that it'll catch messages like this. The message's subject was "Joke of
> the Day Nov 2" and the actual subject matter was a bit of joke/story
> plus a lot of "lose weight, increase the size of your ..." kind of
> trash. Graham gave the message a 0.99000 (spam) rating.
>
> My goal in writing about this troublesome message is to find some ideas
> for dealing effectively with it.
>
> David
>
How does Graham choose the most interesting words?
It seems to me that Robinson might be a slow learner because there are so
many more words to work with. Given that, there will be a greater tendency
for the overall score to shift towards 0.5 in both spam and ham corpus
because a greater majority of the words will be coming in as indeterminate words.
I thought that the basic math was the same in either case, just that the
selection of words was limited in Graham and that there was a ceiling to the
maximum score available in Graham.
--
BOFH excuse #196:
Me no internet, only janitor, me just wax floors.
More information about the Bogofilter
mailing list