min_dev

Fri Jun 25 21:56:07 CEST 2004

My ham zone is very small (0 - 0.15) and my spam zone is quite a bit larger
(0.46 - 1.0).  I still get lots of spams in my unsures and no hams in my
unsures.  Therefore, I could stand to expand my spam zone even further.
However, my unsure zone is already off-center from the 0.5 point used for
min_dev calculations, and shifting it even further down would cause these
parameters to conflict.  Bogofilter still makes the incorrect assumption
that 0.5 is neutral for min_dev calculations.  In reality, 0.5 is well
inside of my spam zone.  If I reduced my spam cutoff, I'd have to make my
min_dev very small so that it doesn't adversely effect the scoring, and I'd
lose the entire benefit of having a min_dev.

We have a parameter to specify the "neutral" point, and that is robx.  If we
have a token we haven't seen before, it is inherently defined as being
unsure, and robx is the value it gets.  Therefore, robx should be the center
of the min_dev calculation.  Problem is, we sometimes want to bias one way
or the other with robx, so that might not be the best value to use.  There's
another way to define the center point for min_dev, and that is exactly
half-way between the spam cutoff and the ham cutoff.  This is the center of
our unsure zone, and it would be ideal for the min_dev calculation.  If a
token scores near the center of the unsure zone, it would naturally be one
that should be excluded from the scoring.  This would make much more sense
than excluding something near 0.5 which is very spammy for me.  And if we
were to use that value, we could go one step further and completely
eliminate the min_dev parameter, instead setting it to the size of the
unsure zone... in other words, min_dev = ((spam_cutoff - ham_cutoff ) / 2)),
and min_dev_center = (ham_cutoff + min_dev).

If this were true, I could now move my spam_cutoff well below 0.5, and I
would still gain the benefit of using min_dev.  My unsure zone might be
0.15 - 0.3, and this could also correlate to the min_dev zone.  This way, if
I receive an email consisting of only one token, it will intuitively be
classified the same as that token.  Currently it doesn't necessarily work
that way.

This only works for three-way classifications though.  Another alternative
method would be to remove the center of min_dev and simply specify a token
ham_cutoff and spam_cutoff in addition to the email ham_cutoff and
spam_cutoff.  This way, you can decide on the three-way classification of
tokens without trying to figure out where it should be centered.  For
instance, I could decide that my email ham_cutoff is 0 (only two-way
classification) and my email spam_cutoff is 0.3, but specify my token
ham_cutoff at 0.2 and my token spam_cutoff at 0.4.  This is the same as a
0.1 min_dev centered at 0.3.

Can anyone provide any reason why the min_dev should be centered at 0.5?  Is
there reason to believe that the calculations wouldn't be improved by
removing that requirement?

Tom