troublesome false negative

Mon Nov 4 04:31:46 CET 2002

At 09:53 PM 11/3/02, Greg Louis wrote:

>On 20021103 (Sun) at 2138:47 -0500, David Relson wrote:
> > At 09:27 PM 11/3/02, you wrote:
> >
> > I have a hypothesis:  many more ham words than spam words will produce a
> > ham result.  I need to figure out how to test it to find out if my
> > intuition is good or bad.
>
>You can derive that from the calculation method without ever looking at
>a real email; that's what I was trying to say when I wrote the "my
>first thought" paragraph.  Pick any number of spam words and decide how
>spammy you want 'em, say 20 words of f(w) = 0.95 (which is pretty
>spammy on the Robinson scale); now pick a nonspammy f(w) value, say
>0.002; it's not a big deal to sit down with a pencil and decide how
>many 0.002's it will take to bring the logarithmic means down to where
>S will come out below whatever your chosen SPAM_CUTOFF value is.  Or if
>you're lazy you can brute-force it with a computer; just write a loop
>to keep adding 1 to the number of 0.002 words till S drops below
>SPAM_CUTOFF.
>
>So your hypothesis is unquestionably valid.  Now what? :)

So, if a message contains lots of words not seen previously, or seen 
rarely, there'll be lots of words with low spamicity.  This will color the 
classification and result in a low spamicity message.  One of the 
differences between Graham and Robinson is that Graham compares 
goodcount+spamcount to MINIMUM_FREQ, while Robinson doesn't have this 
check.  Words with totalcount < MINIMUM_FREQ are given UNKNOWN_WORD 
(currently 0.4) as their spamicity.  I'll have to test and see what this 
does to my troublesome message.

BCNU.

David