x excluding unknowns, was: William's suggestion

Greg Louis glouis at dynamicro.on.ca
Sun Nov 24 16:25:18 CET 2002


On 20021123 (Sat) at 1727:01 -0500, David Relson wrote:

> One interesting detail is the relation between ROBX and MIN_DEV.  Using 
> 0.415 and 0.1, unknown words are excluded from the computation.  This is 
> certainly different from ESR's original implementation which took unknowns 
> as ham indicators and used the good_bias of 2.

Turns out that it may be a good thing to exclude unknown words.  These
data are from an experiment for which the training set was small enough
to make it likely that the test set would have unkowns in it (4176
nonspams and 685 spams); the percentages of false negatives are shown
for three independent runs of 457 spams each, for min_dev=0.1, with
x=0.415 (which excludes unknowns) and x=0.395 (which doesn't):

x  0.415   0.395
1   15.1    17.7
2   16.8    19.3
3   20.6    22.3

Running a paired t test on the data gives

P=0.015
mean diff = 2.267
95% c.l.  = 1.04, 3.49

so using x=0.395 produced significantly worse results with these
messages.

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |




More information about the Bogofilter mailing list