troublesome false negative
David Relson
relson at osagesoftware.com
Mon Nov 4 01:55:06 CET 2002
At 07:27 PM 11/3/02, Tom Allison wrote:
>How does Graham choose the most interesting words?
For each word, Graham looks at how far the word's spamicity is from
1/2. So a word with probability 0.81 (distance of 0.31) is less
interesting than a word at 0.15 (dist of 0.35). Graham takes the 15 most
interesting words (called "extrema" by ESR) and computes the spamicity from
them.
One problem with Graham is that it is order dependant. Imagine a message
having 15 words at 0.1 (definitely ham) and 15 words at 0.9 (definitely
spam). If the 0.1's come first, the extrema array will fill with them. If
the 0.9's come first, they will fill the extrema array. In the first case,
the message is ham; in the second, it's spam.
This particular order dependency can be worked around by a sort-merge
approach. First calculate all the probabilities. Then sort the ham words
into one array and the spam into a second array. Then alternatively put a
ham word and a spam word into the extrema array. This will ensure that ham
and spam get equal opportunity to fill the array.
>It seems to me that Robinson might be a slow learner because there are so
>many more words to work with. Given that, there will be a greater
>tendency for the overall score to shift towards 0.5 in both spam and ham
>corpus because a greater majority of the words will be coming in as
>indeterminate words.
Robinson and Graham work with the same group of words. Most of those words
will be uninteresting, i.e. not a strong indicator of hamness nor of
spamness. I don't think there's a difference in learning rates between the
two messages. When is different is the number of words considered when
computing the final result.
Consider a message with 50 very, very spam words (or the worst advertising
or sexual nature you can think of) and 2 or 3 times as many very, very ham
words (perhaps from a totally harmless children's story). As discussed
above, Graham will judge it ham or spam depending on the order of the
words. With the sort-merge modification, the message will get a 50%
probability of being spam. With Robinson, the message is going to be ham
because of the preponderance of ham words. However, with the 50 spam
words, _I_ would call it spam.
>I thought that the basic math was the same in either case, just that the
>selection of words was limited in Graham and that there was a ceiling to
>the maximum score available in Graham.
More information about the Bogofilter
mailing list