Exclusion Intervals

Wed Jun 30 14:38:22 CEST 2004

Hi Tom,

There's been a lot of discussion recently about min_dev and the fact
that the exclusion interval goes from 0.5-min_dev to 0.5+min_dev.
Thinking about it this morning, I realized that there's an element of
comparing apples to oranges in this discussion.  

Point 1:  Token scores are individual probabilities centered around 0.5,
aka "even odds". 

Point 2: Message scores are the result of a chi-square test and
bogofilter normalizes the result to the 0..1 interval.  

Here's a bit more detail:

Step 1 of scoring a message is to score each token.  This gives
probability scores, which are centered around 0.5.  These values are
(roughly) linear, with 0.0 meaning "completely hammish", 0.5 menaing "no
clue", and 1.0 meaning "completely spammish".  min_dev applies to these
values.  

Step 2 is to apply the bayesian computation to these probabilities. This
produces another probability.  This value is also linear (in the same
sense as the step 1 value).

Step 3 applies the inverse chi-square test.  This looks at the step 2
score and the number of tokens comprising it and computes a value
indicating the "certainty" with which the score represents ham or spam.
If I remember what little I know of statistica, this "certainty" is on a
bell curve.  The actual computed value ranges between -1 and +1 and
bogofilter normalizes it to a value between 0 and 1.

Step 4 applies the ham_cutoff and spam_cutoff values to classify the
message as ham, spam, or unsure.

Both steps 1 and 4 can be considered as having "range centers" and
"range widths".  This similarity does not mean that these steps have
comparable centers or comparable widths.  I've run bogotune with a
variety of test corpora and looked at its recommendations.  It generall
recommends a spam cutoff slightly above 0.5 (often a value like
0.500010) and a ham cutoff much below 0.5 (at least 0.125).  This lack
of symmetry for final score is a further indication of apples and
oranges, i.e. scoring tokens with a symmetric and centered exclusion
interval produces inverse chi-square results with a differently sized
and centered exclusion interval.

As I've indicated, I'm willing to add (on an experimental basis) a
parameter for specifying the center of the exclusion interval.  At the
moment I've got no clue how much of a difference doing that will
make.  So far, however, NOBODY has responded to those suggestions.

Regards,

David