Exclusion Intervals

Wed Jun 30 19:08:53 CEST 2004

On Wed, 30 Jun 2004 12:41:56 -0400
Tom Anderson wrote:

> From: "David Relson" <relson at osagesoftware.com>
> > Point 1:  Token scores are individual probabilities centered around
> > 0.5, aka "even odds". 
> 
> I'm not an expert on how bogofilter achieves it's classification, but
> here's what I do know:
> 
> 1) Scores are biased by robx, robs for unknown or little known tokens.
>  
> 2)When I look up a single token in the database with bogoutil, a near
> equal count for spam and ham will be very, very hammy.  Eg:
> 
>     bogoutil -p wordlist.db spam
>                 spam    good    Fisher
>     spam    1999    1557  0.043814
> 
>     bogoutil -p wordlist.db Jan
>                 spam    good    Fisher
>     Jan       3646      331  0.282159
> 
> To me, "even odds" would mean that "Jan" has a probabilty of appearing
> in spam of 3646/3977 or about 92%, and a probability of appearing in
> ham 331/3977 or about 8%.  How this becomes 28% probability of being
> spam I don't exactly know, but it doesn't appear to be 50/50.  And if
> I choose a token which does not exist in the database at all, the
> Fisher value is computed as 0.52, which also seems odd, especially
> since my robx is 0.46.

Tom,

You're forgetting the importance of .MSG_COUNT.  Suppose you have 10
spam and 100 ham and Jan has counts of 10/10.  Since it's in 100% of the
spam and 10% of the ham, its spam score should be up around 90%.

Here's a demonstration of what I mean:

[relson at osage relson]$ rm -f wordlist.db
[relson at osage relson]$ echo .MSG_COUNT 10 100 | bogoutil -l wordlist.db
[relson at osage relson]$ echo Jan 10 10 | bogoutil -l wordlist.db
[relson at osage relson]$ bogoutil -d wordlist.db
.MSG_COUNT 10 100 20040630
Jan 10 10 20040630
[relson at osage relson]$ bogoutil -p wordlist.db Jan
                                 spam    good    Fisher
Jan                                10      10  0.908745

> 
> Seeing as this is the case, it would seem that 0.5 is not the mid
> point for token scoring.
> 
> > Point 2: Message scores are the result of a chi-square test and
> > bogofilter normalizes the result to the 0..1 interval.  
> 
> In the extreme case of having only one recognizeable token, the email
> should score as the score of the single token, right?  This has
> implications for where the cutoffs can be, where robx can be, and the
> size of min_dev, as the probability of the whole email becomes
> entangled with the probability of a single token.  Now they're both
> oranges, but maintaining certain properties of apples.  Correct me if
> I'm wrong.  Expanding this to multiple tokens should logically keep
> this entanglement to some extent, perhaps becoming less so with more
> tokens.  Assuming this entanglement exists, then there is a paradox
> when tokens are centered on 0.5 and emails are centered elsewhere.

Again, the response is "No".  The bayesian result will be approx the
score of the token.  However the chi-square test is (as pi pointed out)
run twice (once to determine a ham number and once to determine a spam
number) and the two numbers are then averaged.  This is not the same as
the tokens score..

If BF was using the Robinson algorithm (instead of Robinson-Fisher), the
numbers would be more as you expect.

> 
> > Here's a bit more detail:
> >
> > Step 1 of scoring a message is to score each token.  This gives
> > probability scores, which are centered around 0.5.  These values are
> > (roughly) linear, with 0.0 meaning "completely hammish", 0.5 menaing
> > "no clue", and 1.0 meaning "completely spammish".  min_dev applies
> > to these values.  
> >
> > Step 2 is to apply the bayesian computation to these probabilities.
> > This produces another probability.  This value is also linear (in
> > the same sense as the step 1 value).
> 
> Here are just a few tokens from the "bogofilter -vvv" output of a
> recent unsure spam I received...
> 
>                                      n    pgood     pbad      fw     U
> "Windows"                         4481  0.075187  0.015459  0.170552 +
> "Graphics"                        1834  0.029372  0.006377  0.178415 +
> "Microsoft"                       4154  0.064300  0.014523  0.184265 +
> "expensive"                        547  0.008052  0.001927  0.193231 +
> "need"                           16181  0.220685  0.057636  0.207088 +
> "head:tanderso"                   7114  0.092198  0.025512  0.216744 +
> "more"                           44110  0.561805  0.158539  0.220088 +
> "Office"                          4060  0.041733  0.014949  0.263739 +
> 
> And this is their corresponding "bogoutil -p" output:
> 
>                                  spam    good    Fisher
> Windows                          3818     663  0.170541
> Graphics                         1575     259  0.178387
> Microsoft                        3587     567  0.184253
> expensive                         476      71  0.193144
> need                            14235    1946  0.207085
> head:tanderso                    6301     813  0.216738
> more                            39156    4954  0.220087
> Office                           3692     368  0.263730
> 
> My config is as follows: robx=0.46, robs=0.2, min_dev=0.2,
> spam_cutoff=0.465, ham_cutoff=0.1
> 
> As you can see, all of these terms occur profusely in spam and
> comparably little in ham, but they are all contributing to the ham
> score!  I don't understand how this is a linear relationship centered
> around 0.5.  Am I missing something?

Yes - .MSG_COUNT

> > Step 3 applies the inverse chi-square test.  This looks at the step
> > 2 score and the number of tokens comprising it and computes a value
> > indicating the "certainty" with which the score represents ham or
> > spam. If I remember what little I know of statistica, this
> > "certainty" is on a bell curve.  The actual computed value ranges
> > between -1 and +1 and bogofilter normalizes it to a value between 0
> > and 1.
> > 
> > Step 4 applies the ham_cutoff and spam_cutoff values to classify the
> > message as ham, spam, or unsure.
> 
> Here's the result of that:
> 
>  X-Bogosity: Unsure, tests=bogofilter, spamicity=0.318373,
>  version=0.17.5
>    int  cnt   prob  spamicity histogram
>   0.00    7 0.033940 0.014140 #######
>   0.10    6 0.173187 0.061171 ######
>   0.20    8 0.248424 0.127623 ########
>   0.30    0 0.000000 0.127623
>   0.40    0 0.000000 0.127623
>   0.50    0 0.000000 0.127623
>   0.60    0 0.000000 0.127623
>   0.70    2 0.727572 0.192375 ##
>   0.80    3 0.837395 0.289786 ###
>   0.90    4 0.972075 0.449355 ####
> 
> As you can see, my 0.2 min_dev is keeping all of the tokens between
> 0.3 and 0.7 from contributing to the final score.  However, those
> tokens in the 0.1-0.3 range are not very hammy (eg: professional,
> office, software, $60, etc), while the ones between 0.5 and 0.7 are
> actually quite spammy (eg: Adobe, Photoshop, etc).  This email would
> probably score appropriately if the min_dev range was centered between
> my cutoffs near 0.3.
> 
> > Both steps 1 and 4 can be considered as having "range centers" and
> > "range widths".  This similarity does not mean that these steps have
> > comparable centers or comparable widths.  I've run bogotune with a
> > variety of test corpora and looked at its recommendations.  It
> > generall recommends a spam cutoff slightly above 0.5 (often a value
> > like 0.500010) and a ham cutoff much below 0.5 (at least 0.125). 
> > This lack of symmetry for final score is a further indication of
> > apples and oranges, i.e. scoring tokens with a symmetric and
> > centered exclusion interval produces inverse chi-square results with
> > a differently sized and centered exclusion interval.
> 
> Actually, the fact that you've fixed 0.5 for tokens and not for
> emails, and that bogotune gives a non-0.5 center for emails indicates
> to me that maybe the center for tokens should be freed up too.  Think
> about it this way... the reason your cutoffs are shifted to the ham
> side is because you're finding that emails classified by bogofilter
> tend to err on the hammy side of 0.5.  Ideally, spam emails would
> score near 1.0, hams near 0.0, and unsures should center around 0.5. 
> This would be a balanced bell curve.  But if you fixed your cutoffs
> around 0.5, you would get lots of false negatives.  In order to
> compensate for this fact, you shift your cutoffs down.  But you
> wouldn't have to shift your cutoffs down if bogofilter scored emails
> clustered at the extrema with unsures at 0.5.  You could just as
> easily fix the cutoffs around 0.5 and free up the min_dev center to
> achieve the same effect as we get today.  Tokens would shift toward
> the spammy side to compensate for their innate bias toward ham, and
> emails would overall score with 0.5 being unsure.  But this would
> produce the same problem as today as well... unnecessary false
> negatives.  The solution still seems to be to free up both of these
> "range centers" to balance out inequities caused by other parameters.

Bogotune does its scans of the parameter grids (multiple possible
values) and finds the combinations that produce the best results (fewest
false positives).  It applies these parameters to the test corpora and
checks to see where the good cutoff points are.  Its results are
empirical with no attempt to balance.

> > As I've indicated, I'm willing to add (on an experimental basis) a
> > parameter for specifying the center of the exclusion interval.  At
> > the moment I've got no clue how much of a difference doing that will
> > make.  So far, however, NOBODY has responded to those suggestions.
> 
> If you could add the parameter to the config to change the min_dev
> center, but to default it to 0.5, then people could experiment with it
> without effecting any existing configurations.
> 
> > Possibly, the center of the exclusion region should be the robx
> > value. It isn't clear.  Bogotune could be modified to vary the
> > center of the exclusion interval.  It might be interesting to see
> > what it finds to be the best value.  All it takes is time.
> 
> I also thought that robx would be a good value at first.  However, the
> entire point of robx is to bias unknown or little known tokens.  If
> you use this value to establish the center-point for all tokens, then
> it loses its quality of biasing.  Robx is not an appropriate value,
> but the mid point should probably be somewhere near robx, or rather
> robx should be thought of as an offset from the mid point rather than
> from 0.5.

Note the word "possibly" in my comment.  I've learned that one can't
predict the values of the good parameter sets for bogofilter. 
Experience has shown that people are successfully using many different
values. 

I'll see about adding a "excl_center" and "excl_magnitude" parameters to
create an exclusion interval from

  excl_center-excl_magnitude to excl_center+excl_magnitude

Regards,

David