<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<HTML><HEAD>

<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">

<META content="MSHTML 6.00.2800.1400" name=GENERATOR>

<STYLE></STYLE>

</HEAD>

<BODY>

<DIV><FONT face=Arial size=2>From: "David Relson" <</FONT><A 

href="mailto:relson@osagesoftware.com"><FONT face=Arial 

size=2>relson@osagesoftware.com</FONT></A><FONT face=Arial 

size=2>></FONT></DIV>

<DIV><FONT face=Arial size=2>> Point 1:  Token scores are individual 

probabilities centered around 0.5,<BR>> aka "even 

odds". <BR></FONT></DIV>

<DIV><FONT face=Arial size=2>I'm not an expert on how bogofilter achieves it's 

classification, but here's what I do know:</FONT></DIV>

<DIV><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face=Arial size=2>1) Scores are biased by robx, robs for unknown 

or little known tokens.  </FONT></DIV>

<DIV><FONT face=Arial size=2>2)When I look up a single token in the database 

with bogoutil, a near equal count for spam and ham will be very, very 

hammy.  Eg:</FONT></DIV>

<DIV><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face=Arial size=2>    bogoutil -p wordlist.db 

spam</FONT></DIV>

<DIV>

<DIV><FONT face=Arial size=2>        

        spam    

good    Fisher</FONT></DIV>

<DIV><FONT face=Arial 

size=2>    spam    1999    1557  0.043814</FONT></DIV>

<DIV><FONT face=Arial size=2></FONT> </DIV></DIV>

<DIV><FONT face=Arial size=2>    bogoutil -p wordlist.db 

Jan</FONT></DIV>

<DIV><FONT face=Arial size=2>        

        spam    

good    Fisher</FONT></DIV>

<DIV><FONT face=Arial size=2>    

Jan       3646      

331  0.282159</FONT></DIV>

<DIV><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face=Arial size=2>To me, "even odds" would mean that "Jan" has a 

probabilty of appearing in spam of 3646/3977 or about 92%, and a probability of 

appearing in ham 331/3977 or about 8%.  How this becomes 28% probability of 

being spam I don't exactly know, but it doesn't appear to be 50/50.  And if 

I choose a token which does not exist in the database at all, the Fisher value 

is computed as 0.52, which also seems odd, especially since my robx is 

0.46.</FONT></DIV>

<DIV><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face=Arial size=2>Seeing as this is the case, it would seem that 0.5 

is not the mid point for token scoring.</FONT></DIV>

<DIV><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face=Arial size=2>> Point 2: Message scores are the result of a 

chi-square test and<BR>> bogofilter normalizes the result to the 0..1 

interval.  <BR></FONT></DIV>

<DIV><FONT face=Arial size=2>In the extreme case of having only one 

recognizeable token, the email should score as the score of the single 

token, right?  This has implications for where the cutoffs can be, where 

robx can be, and the size of min_dev, as the probability of the whole email 

becomes entangled with the probability of a single token.  Now they're both 

oranges, but maintaining certain properties of apples.  Correct me if I'm 

wrong.  Expanding this to multiple tokens should logically keep this 

entanglement to some extent, perhaps becoming less so with more tokens.  

Assuming this entanglement exists, then there is a paradox when tokens are 

centered on 0.5 and emails are centered elsewhere.</FONT></DIV>

<DIV><BR><FONT face=Arial size=2>> Here's a bit more 

detail:<BR>></FONT></DIV>

<DIV><FONT face=Arial size=2>> Step 1 of scoring a message is to score each 

token.  This gives<BR>> probability scores, which are centered around 

0.5.  These values are<BR>> (roughly) linear, with 0.0 meaning 

"completely hammish", 0.5 menaing "no<BR>> clue", and 1.0 meaning "completely 

spammish".  min_dev applies to these<BR>> values.  </FONT></DIV>

<DIV><FONT face=Arial size=2>><BR>> Step 2 is to apply the bayesian 

computation to these probabilities. This<BR>> produces another 

probability.  This value is also linear (in the same<BR>> sense as the 

step 1 value).</FONT></DIV>

<DIV><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face=Arial size=2>Here are just a few tokens from 

the "bogofilter -vvv" output of a recent unsure spam I 

received...</FONT></DIV><FONT face=Arial size=2></FONT>

<DIV><FONT face=Arial size=2></FONT><BR><FONT face="Courier New" 

size=2>                                     

n    pgood     

pbad      fw     U</FONT></DIV>

<DIV><FONT face="Courier New" 

size=2>"Windows"                         

4481  0.075187  0.015459  0.170552 

+<BR>"Graphics"                        

1834  0.029372  0.006377  0.178415 

+<BR>"Microsoft"                       

4154  0.064300  0.014523  0.184265 

+<BR>"expensive"                        

547  0.008052  0.001927  0.193231 

+<BR>"need"                           

16181  0.220685  0.057636  0.207088 

+<BR>"head:tanderso"                   

7114  0.092198  0.025512  0.216744 

+<BR>"more"                           

44110  0.561805  0.158539  0.220088 

+<BR>"Office"                          

4060  0.041733  0.014949  0.263739 +<BR></DIV></FONT>

<DIV><FONT face=Arial size=2>And this is their corresponding "bogoutil -p" 

output:</FONT></DIV>

<DIV><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face="Courier New" 

size=2>                                 

spam    good    

Fisher<BR>Windows                          

3818     663  

0.170541<BR>Graphics                         

1575     259  

0.178387<BR>Microsoft                        

3587     567  

0.184253<BR>expensive                         

476      71  

0.193144<BR>need                            

14235    1946  

0.207085<BR>head:tanderso                    

6301     813  

0.216738<BR>more                            

39156    4954  

0.220087<BR>Office                           

3692     368  0.263730</FONT></DIV>

<DIV><FONT face=Arial size=2></FONT><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face=Arial size=2>

<DIV><FONT face=Arial size=2>My config is as follows: robx=0.46, robs=0.2, 

min_dev=0.2, spam_cutoff=0.465, ham_cutoff=0.1</FONT></DIV></FONT></DIV>

<DIV><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face=Arial size=2>As you can see, all of these terms occur profusely 

in spam and comparably little in ham, but they are all contributing to the ham 

score!  I don't understand how this is a linear relationship centered 

around 0.5.  Am I missing something?</FONT></DIV><FONT face=Arial 

size=2></FONT><FONT face=Arial size=2></FONT>

<DIV><BR><FONT face=Arial size=2>> Step 3 applies the inverse chi-square 

test.  This looks at the step 2<BR>> score and the number of tokens 

comprising it and computes a value<BR>> indicating the "certainty" with which 

the score represents ham or spam.<BR>> If I remember what little I know of 

statistica, this "certainty" is on a<BR>> bell curve.  The actual 

computed value ranges between -1 and +1 and<BR>> bogofilter normalizes it to 

a value between 0 and 1.<BR>> <BR>> Step 4 applies the ham_cutoff and 

spam_cutoff values to classify the<BR>> message as ham, spam, or 

unsure.<BR></FONT></DIV>

<DIV><FONT face=Arial size=2>Here's the result of that:</FONT></DIV>

<DIV><FONT face=Arial size=2> </DIV></FONT>

<DIV><FONT face=Arial size=2><FONT face="Courier New"> X-Bogosity: Unsure, 

tests=bogofilter, spamicity=0.318373, version=0.17.5<BR>   int  

cnt   prob  spamicity histogram<BR>  0.00    

7 0.033940 0.014140 #######<BR>  0.10    6 0.173187 0.061171 

######<BR>  0.20    8 0.248424 0.127623 ########<BR>  

0.30    0 0.000000 0.127623<BR>  0.40    0 

0.000000 0.127623<BR>  0.50    0 0.000000 0.127623<BR>  

0.60    0 0.000000 0.127623<BR>  0.70    2 

0.727572 0.192375 ##<BR>  0.80    3 0.837395 0.289786 

###<BR>  0.90    4 0.972075 0.449355 

####</FONT><BR></FONT></DIV>

<DIV><FONT face=Arial size=2>As you can see, my 0.2 min_dev is keeping all of 

the tokens between 0.3 and 0.7 from contributing to the final score.  

However, those tokens in the 0.1-0.3 range are not very hammy (eg: professional, 

office, software, $60, etc), while the ones between 0.5 and 0.7 are actually 

quite spammy (eg: Adobe, Photoshop, etc).  This email would probably score 

appropriately if the min_dev range was centered between my cutoffs near 

0.3.</FONT><FONT face=Arial size=2></DIV>

<DIV><BR>> Both steps 1 and 4 can be considered as having "range centers" 

and<BR>> "range widths".  This similarity does not mean that these steps 

have<BR>> comparable centers or comparable widths.  I've run bogotune 

with a<BR>> variety of test corpora and looked at its recommendations.  

It generall<BR>> recommends a spam cutoff slightly above 0.5 (often a value 

like<BR>> 0.500010) and a ham cutoff much below 0.5 (at least 0.125).  

This lack<BR>> of symmetry for final score is a further indication of apples 

and<BR>> oranges, i.e. scoring tokens with a symmetric and centered 

exclusion<BR>> interval produces inverse chi-square results with a 

differently sized<BR>> and centered exclusion interval.<BR></DIV>

<DIV>Actually, the fact that you've fixed 0.5 for tokens and not for emails, and 

that bogotune gives a non-0.5 center for emails indicates to me that maybe 

the center for tokens should be freed up too.  Think about it this 

way... the reason your cutoffs are shifted to the ham side is because you're 

finding that emails classified by bogofilter tend to err on the hammy side of 

0.5.  Ideally, spam emails would score near 1.0, hams near 0.0, and unsures 

should center around 0.5.  This would be a balanced bell curve.  But 

if you fixed your cutoffs around 0.5, you would get lots of false 

negatives.  In order to compensate for this fact, you shift your cutoffs 

down.  But you wouldn't have to shift your cutoffs down if bogofilter 

scored emails clustered at the extrema with unsures at 0.5.  You could just 

as easily fix the cutoffs around 0.5 and free up the min_dev center to achieve 

the same effect as we get today.  Tokens would shift toward the spammy side 

to compensate for their innate bias toward ham, and emails would overall score 

with 0.5 being unsure.  But this would produce the same problem as today as 

well... unnecessary false negatives.  The solution still seems to be to 

free up both of these "range centers" to balance out inequities caused by other 

parameters.</DIV>

<DIV> </DIV>

<DIV>> As I've indicated, I'm willing to add (on an experimental basis) 

a<BR>> parameter for specifying the center of the exclusion interval.  

At the<BR>> moment I've got no clue how much of a difference doing that 

will<BR>> make.  So far, however, NOBODY has responded to those 

suggestions.<BR></DIV>

<DIV>If you could add the parameter to the config to change the min_dev center, 

but to default it to 0.5, then people could experiment with it without effecting 

any existing configurations.</DIV>

<DIV> </DIV>

<DIV>> Possibly, the center of the exclusion region should be the robx 

value.<BR>> It isn't clear.  Bogotune could be modified to vary the 

center of the<BR>> exclusion interval.  It might be interesting to see 

what it finds to be<BR>> the best value.  All it takes is 

time.<BR></DIV>

<DIV>I also thought that robx would be a good value at first.  However, the 

entire point of robx is to bias unknown or little known tokens.  If you use 

this value to establish the center-point for all tokens, then it loses its 

quality of biasing.  Robx is not an appropriate value, but the mid point 

should probably be somewhere near robx, or rather robx should be thought of as 

an offset from the mid point rather than from 0.5.</DIV>

<DIV> </DIV>

<DIV>Tom</DIV>

<DIV> </DIV></FONT></BODY></HTML>