troublesome false negative
Greg Louis
glouis at dynamicro.on.ca
Mon Nov 4 12:29:16 CET 2002
On 20021104 (Mon) at 0621:31 -0500, Greg Louis wrote:
> > One of the
> > differences between Graham and Robinson is that Graham compares
> > goodcount+spamcount to MINIMUM_FREQ, while Robinson doesn't have this
> > check.
>
> Robinson's paper explains why, if you use the f(w) calculation, this
> check is not needed. This is what the s and x parameters are about.
> Remember, x (which I currently have set to 0.415 and you to 0.200) is
> the probability that an unknown word will receive, and s determines the
> weight of x with respect to the actual count when the actual count is
> low.
>
Sorry, I meant to add that in my bogofilter setup your "msg.1103.txt"
(the message with lots of spam words but lots of nonspam-looking words
as well) produces an S value of 0.649192, well over my 0.542 spam
cutoff. I also meant to add that setting the probability for an
unknown word to 0.2 might be part of your trouble; that gives words
never seen before, or seen only a few times, a lot of weight on the
nonspam side.
--
| G r e g L o u i s | gpg or pgp: finger |
| Consultronics Corporate Manager | glouis at consultronics.com |
| Information Systems & Technology | for public keys |
| http://www.consultronics.com | http://www.bgl.nu/~glouis |
More information about the Bogofilter
mailing list