Exclusion Intervals

Wed Jun 30 18:41:56 CEST 2004

From: "David Relson" <relson at osagesoftware.com>
> Point 1:  Token scores are individual probabilities centered around 0.5,
> aka "even odds". 

I'm not an expert on how bogofilter achieves it's classification, but here's what I do know:

1) Scores are biased by robx, robs for unknown or little known tokens.  
2)When I look up a single token in the database with bogoutil, a near equal count for spam and ham will be very, very hammy.  Eg:

    bogoutil -p wordlist.db spam
                spam    good    Fisher
    spam    1999    1557  0.043814

    bogoutil -p wordlist.db Jan
                spam    good    Fisher
    Jan       3646      331  0.282159

To me, "even odds" would mean that "Jan" has a probabilty of appearing in spam of 3646/3977 or about 92%, and a probability of appearing in ham 331/3977 or about 8%.  How this becomes 28% probability of being spam I don't exactly know, but it doesn't appear to be 50/50.  And if I choose a token which does not exist in the database at all, the Fisher value is computed as 0.52, which also seems odd, especially since my robx is 0.46.

Seeing as this is the case, it would seem that 0.5 is not the mid point for token scoring.

> Point 2: Message scores are the result of a chi-square test and
> bogofilter normalizes the result to the 0..1 interval.  

In the extreme case of having only one recognizeable token, the email should score as the score of the single token, right?  This has implications for where the cutoffs can be, where robx can be, and the size of min_dev, as the probability of the whole email becomes entangled with the probability of a single token.  Now they're both oranges, but maintaining certain properties of apples.  Correct me if I'm wrong.  Expanding this to multiple tokens should logically keep this entanglement to some extent, perhaps becoming less so with more tokens.  Assuming this entanglement exists, then there is a paradox when tokens are centered on 0.5 and emails are centered elsewhere.

> Here's a bit more detail:
>
> Step 1 of scoring a message is to score each token.  This gives
> probability scores, which are centered around 0.5.  These values are
> (roughly) linear, with 0.0 meaning "completely hammish", 0.5 menaing "no
> clue", and 1.0 meaning "completely spammish".  min_dev applies to these
> values.  
>
> Step 2 is to apply the bayesian computation to these probabilities. This
> produces another probability.  This value is also linear (in the same
> sense as the step 1 value).

Here are just a few tokens from the "bogofilter -vvv" output of a recent unsure spam I received...

                                     n    pgood     pbad      fw     U
"Windows"                         4481  0.075187  0.015459  0.170552 +
"Graphics"                        1834  0.029372  0.006377  0.178415 +
"Microsoft"                       4154  0.064300  0.014523  0.184265 +
"expensive"                        547  0.008052  0.001927  0.193231 +
"need"                           16181  0.220685  0.057636  0.207088 +
"head:tanderso"                   7114  0.092198  0.025512  0.216744 +
"more"                           44110  0.561805  0.158539  0.220088 +
"Office"                          4060  0.041733  0.014949  0.263739 +

And this is their corresponding "bogoutil -p" output:

                                 spam    good    Fisher
Windows                          3818     663  0.170541
Graphics                         1575     259  0.178387
Microsoft                        3587     567  0.184253
expensive                         476      71  0.193144
need                            14235    1946  0.207085
head:tanderso                    6301     813  0.216738
more                            39156    4954  0.220087
Office                           3692     368  0.263730

My config is as follows: robx=0.46, robs=0.2, min_dev=0.2, spam_cutoff=0.465, ham_cutoff=0.1

As you can see, all of these terms occur profusely in spam and comparably little in ham, but they are all contributing to the ham score!  I don't understand how this is a linear relationship centered around 0.5.  Am I missing something?

> Step 3 applies the inverse chi-square test.  This looks at the step 2
> score and the number of tokens comprising it and computes a value
> indicating the "certainty" with which the score represents ham or spam.
> If I remember what little I know of statistica, this "certainty" is on a
> bell curve.  The actual computed value ranges between -1 and +1 and
> bogofilter normalizes it to a value between 0 and 1.
> 
> Step 4 applies the ham_cutoff and spam_cutoff values to classify the
> message as ham, spam, or unsure.

Here's the result of that:

 X-Bogosity: Unsure, tests=bogofilter, spamicity=0.318373, version=0.17.5
   int  cnt   prob  spamicity histogram
  0.00    7 0.033940 0.014140 #######
  0.10    6 0.173187 0.061171 ######
  0.20    8 0.248424 0.127623 ########
  0.30    0 0.000000 0.127623
  0.40    0 0.000000 0.127623
  0.50    0 0.000000 0.127623
  0.60    0 0.000000 0.127623
  0.70    2 0.727572 0.192375 ##
  0.80    3 0.837395 0.289786 ###
  0.90    4 0.972075 0.449355 ####

As you can see, my 0.2 min_dev is keeping all of the tokens between 0.3 and 0.7 from contributing to the final score.  However, those tokens in the 0.1-0.3 range are not very hammy (eg: professional, office, software, $60, etc), while the ones between 0.5 and 0.7 are actually quite spammy (eg: Adobe, Photoshop, etc).  This email would probably score appropriately if the min_dev range was centered between my cutoffs near 0.3.

> Both steps 1 and 4 can be considered as having "range centers" and
> "range widths".  This similarity does not mean that these steps have
> comparable centers or comparable widths.  I've run bogotune with a
> variety of test corpora and looked at its recommendations.  It generall
> recommends a spam cutoff slightly above 0.5 (often a value like
> 0.500010) and a ham cutoff much below 0.5 (at least 0.125).  This lack
> of symmetry for final score is a further indication of apples and
> oranges, i.e. scoring tokens with a symmetric and centered exclusion
> interval produces inverse chi-square results with a differently sized
> and centered exclusion interval.

Actually, the fact that you've fixed 0.5 for tokens and not for emails, and that bogotune gives a non-0.5 center for emails indicates to me that maybe the center for tokens should be freed up too.  Think about it this way... the reason your cutoffs are shifted to the ham side is because you're finding that emails classified by bogofilter tend to err on the hammy side of 0.5.  Ideally, spam emails would score near 1.0, hams near 0.0, and unsures should center around 0.5.  This would be a balanced bell curve.  But if you fixed your cutoffs around 0.5, you would get lots of false negatives.  In order to compensate for this fact, you shift your cutoffs down.  But you wouldn't have to shift your cutoffs down if bogofilter scored emails clustered at the extrema with unsures at 0.5.  You could just as easily fix the cutoffs around 0.5 and free up the min_dev center to achieve the same effect as we get today.  Tokens would shift toward the spammy side to compensate for their innate bias toward ham, and emails would overall score with 0.5 being unsure.  But this would produce the same problem as today as well... unnecessary false negatives.  The solution still seems to be to free up both of these "range centers" to balance out inequities caused by other parameters.

> As I've indicated, I'm willing to add (on an experimental basis) a
> parameter for specifying the center of the exclusion interval.  At the
> moment I've got no clue how much of a difference doing that will
> make.  So far, however, NOBODY has responded to those suggestions.

If you could add the parameter to the config to change the min_dev center, but to default it to 0.5, then people could experiment with it without effecting any existing configurations.

> Possibly, the center of the exclusion region should be the robx value.
> It isn't clear.  Bogotune could be modified to vary the center of the
> exclusion interval.  It might be interesting to see what it finds to be
> the best value.  All it takes is time.

I also thought that robx would be a good value at first.  However, the entire point of robx is to bias unknown or little known tokens.  If you use this value to establish the center-point for all tokens, then it loses its quality of biasing.  Robx is not an appropriate value, but the mid point should probably be somewhere near robx, or rather robx should be thought of as an offset from the mid point rather than from 0.5.

Tom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.bogofilter.org/pipermail/bogofilter/attachments/20040630/5bf449ed/attachment.htm>