troublesome false negative
David Relson
relson at osagesoftware.com
Mon Nov 4 04:31:46 CET 2002
At 09:53 PM 11/3/02, Greg Louis wrote:
>On 20021103 (Sun) at 2138:47 -0500, David Relson wrote:
> > At 09:27 PM 11/3/02, you wrote:
> >
> > I have a hypothesis: many more ham words than spam words will produce a
> > ham result. I need to figure out how to test it to find out if my
> > intuition is good or bad.
>
>You can derive that from the calculation method without ever looking at
>a real email; that's what I was trying to say when I wrote the "my
>first thought" paragraph. Pick any number of spam words and decide how
>spammy you want 'em, say 20 words of f(w) = 0.95 (which is pretty
>spammy on the Robinson scale); now pick a nonspammy f(w) value, say
>0.002; it's not a big deal to sit down with a pencil and decide how
>many 0.002's it will take to bring the logarithmic means down to where
>S will come out below whatever your chosen SPAM_CUTOFF value is. Or if
>you're lazy you can brute-force it with a computer; just write a loop
>to keep adding 1 to the number of 0.002 words till S drops below
>SPAM_CUTOFF.
>
>So your hypothesis is unquestionably valid. Now what? :)
So, if a message contains lots of words not seen previously, or seen
rarely, there'll be lots of words with low spamicity. This will color the
classification and result in a low spamicity message. One of the
differences between Graham and Robinson is that Graham compares
goodcount+spamcount to MINIMUM_FREQ, while Robinson doesn't have this
check. Words with totalcount < MINIMUM_FREQ are given UNKNOWN_WORD
(currently 0.4) as their spamicity. I'll have to test and see what this
does to my troublesome message.
BCNU.
David
More information about the Bogofilter
mailing list