Exclusion Intervals

Wed Jun 30 15:10:50 CEST 2004

David Relson <relson at osagesoftware.com> wrote:

>Step 1 of scoring a message is to score each token.  This gives
>probability scores, which are centered around 0.5.  These values are
>(roughly) linear, with 0.0 meaning "completely hammish", 0.5 menaing "no
>clue", and 1.0 meaning "completely spammish".  min_dev applies to these
>values.  
>
>Step 2 is to apply the bayesian computation to these probabilities. This
>produces another probability.  This value is also linear (in the same
>sense as the step 1 value).
>
>Step 3 applies the inverse chi-square test.  This looks at the step 2
>score and the number of tokens comprising it and computes a value
>indicating the "certainty" with which the score represents ham or spam.
>If I remember what little I know of statistica, this "certainty" is on a
>bell curve.  The actual computed value ranges between -1 and +1 and
>bogofilter normalizes it to a value between 0 and 1.

More exactly, we have two values here which are combined.
Those two values come from test if we can refute the
statement that the message is ham/spam respectively. Each of
those is highly non symmetric. Also those values are *not*
probability values. In particular this is true for their
(normalized) combination, even if it looks like it.
Furthermore, the values depend on additional parameters
(like robx and robs), so there is no particular statement
associated to the value .5.

pi