troublesome false negative

David Relson relson at osagesoftware.com
Mon Nov 4 01:55:06 CET 2002


At 07:27 PM 11/3/02, Tom Allison wrote:

>How does Graham choose the most interesting words?

For each word, Graham looks at how far the word's spamicity is from 
1/2.  So a word with probability 0.81 (distance of 0.31) is less 
interesting than a word at 0.15 (dist of 0.35).  Graham takes the 15 most 
interesting words (called "extrema" by ESR) and computes the spamicity from 
them.

One problem with Graham is that it is order dependant.  Imagine a message 
having 15 words at 0.1 (definitely ham) and 15 words at 0.9 (definitely 
spam).  If the 0.1's come first, the extrema array will fill with them.  If 
the 0.9's come first, they will fill the extrema array.  In the first case, 
the message is ham; in the second, it's spam.

This particular order dependency can be worked around by a sort-merge 
approach.  First calculate all the probabilities.  Then sort the ham words 
into one array and the spam into a second array.  Then alternatively put a 
ham word and a spam word into the extrema array.  This will ensure that ham 
and spam get equal opportunity to fill the array.

>It seems to me that Robinson might be a slow learner because there are so 
>many more words to work with.  Given that, there will be a greater 
>tendency for the overall score to shift towards 0.5 in both spam and ham 
>corpus because a greater majority of the words will be coming in as 
>indeterminate words.

Robinson and Graham work with the same group of words.  Most of those words 
will be uninteresting, i.e. not a strong indicator of hamness nor of 
spamness.  I don't think there's a difference in learning rates between the 
two messages.  When is different is the number of words considered when 
computing the final result.

Consider a message with 50 very, very spam words (or the worst advertising 
or sexual nature you can think of) and 2 or 3 times as many very, very ham 
words (perhaps from a totally harmless children's story).  As discussed 
above, Graham will judge it ham or spam depending on the order of the 
words.  With the sort-merge modification, the message will get a 50% 
probability of being spam.  With Robinson, the message is going to be ham 
because of the preponderance of ham words.  However, with the 50 spam 
words, _I_ would call it spam.


>I thought that the basic math was the same in either case, just that the 
>selection of words was limited in Graham and that there was a ceiling to 
>the maximum score available in Graham.





More information about the Bogofilter mailing list