troublesome false negative

Greg Louis glouis at dynamicro.on.ca
Mon Nov 4 12:29:16 CET 2002


On 20021104 (Mon) at 0621:31 -0500, Greg Louis wrote:
> > One of the 
> > differences between Graham and Robinson is that Graham compares 
> > goodcount+spamcount to MINIMUM_FREQ, while Robinson doesn't have this 
> > check.
> 
> Robinson's paper explains why, if you use the f(w) calculation, this
> check is not needed.  This is what the s and x parameters are about. 
> Remember, x (which I currently have set to 0.415 and you to 0.200) is
> the probability that an unknown word will receive, and s determines the
> weight of x with respect to the actual count when the actual count is
> low.
> 
Sorry, I meant to add that in my bogofilter setup your "msg.1103.txt"
(the message with lots of spam words but lots of nonspam-looking words
as well) produces an S value of 0.649192, well over my 0.542 spam
cutoff.  I also meant to add that setting the probability for an
unknown word to 0.2 might be part of your trouble; that gives words
never seen before, or seen only a few times, a lot of weight on the
nonspam side.

-- 
| G r e g  L o u i s               | gpg or pgp: finger        |
| Consultronics Corporate Manager  |  glouis at consultronics.com |
| Information Systems & Technology |  for public keys          |
| http://www.consultronics.com     | http://www.bgl.nu/~glouis |




More information about the Bogofilter mailing list