Exclusion Intervals

David Relson relson at osagesoftware.com
Wed Jun 30 15:29:21 CEST 2004


On Wed, 30 Jun 2004 15:10:50 +0200
Boris 'pi' Piwinger wrote:

> David Relson <relson at osagesoftware.com> wrote:
> 
> >Step 1 of scoring a message is to score each token.  This gives
> >probability scores, which are centered around 0.5.  These values are
> >(roughly) linear, with 0.0 meaning "completely hammish", 0.5 menaing
> >"no clue", and 1.0 meaning "completely spammish".  min_dev applies to
> >these values.  
> >
> >Step 2 is to apply the bayesian computation to these probabilities.
> >This produces another probability.  This value is also linear (in the
> >same sense as the step 1 value).
> >
> >Step 3 applies the inverse chi-square test.  This looks at the step 2
> >score and the number of tokens comprising it and computes a value
> >indicating the "certainty" with which the score represents ham or
> >spam. If I remember what little I know of statistica, this
> >"certainty" is on a bell curve.  The actual computed value ranges
> >between -1 and +1 and bogofilter normalizes it to a value between 0
> >and 1.
> 
> More exactly, we have two values here which are combined.
> Those two values come from test if we can refute the
> statement that the message is ham/spam respectively. Each of
> those is highly non symmetric. Also those values are *not*
> probability values. In particular this is true for their
> (normalized) combination, even if it looks like it.
> Furthermore, the values depend on additional parameters
> (like robx and robs), so there is no particular statement
> associated to the value .5.

pi,

Right!  Thank you for the correction.

Possibly, the center of the exclusion region should be the robx value.
It isn't clear.  Bogotune could be modified to vary the center of the
exclusion interval.  It might be interesting to see what it finds to be
the best value.  All it takes is time.

David



More information about the Bogofilter mailing list