<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">
<META content="MSHTML 6.00.2800.1400" name=GENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY>
<DIV><FONT face=Arial size=2>From: "David Relson" <</FONT><A
href="mailto:relson@osagesoftware.com"><FONT face=Arial
size=2>relson@osagesoftware.com</FONT></A><FONT face=Arial
size=2>></FONT></DIV>
<DIV><FONT face=Arial size=2>> Point 1: Token scores are individual
probabilities centered around 0.5,<BR>> aka "even
odds". <BR></FONT></DIV>
<DIV><FONT face=Arial size=2>I'm not an expert on how bogofilter achieves it's
classification, but here's what I do know:</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>1) Scores are biased by robx, robs for unknown
or little known tokens. </FONT></DIV>
<DIV><FONT face=Arial size=2>2)When I look up a single token in the database
with bogoutil, a near equal count for spam and ham will be very, very
hammy. Eg:</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2> bogoutil -p wordlist.db
spam</FONT></DIV>
<DIV>
<DIV><FONT face=Arial size=2>
spam
good Fisher</FONT></DIV>
<DIV><FONT face=Arial
size=2> spam 1999 1557 0.043814</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV></DIV>
<DIV><FONT face=Arial size=2> bogoutil -p wordlist.db
Jan</FONT></DIV>
<DIV><FONT face=Arial size=2>
spam
good Fisher</FONT></DIV>
<DIV><FONT face=Arial size=2>
Jan 3646
331 0.282159</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>To me, "even odds" would mean that "Jan" has a
probabilty of appearing in spam of 3646/3977 or about 92%, and a probability of
appearing in ham 331/3977 or about 8%. How this becomes 28% probability of
being spam I don't exactly know, but it doesn't appear to be 50/50. And if
I choose a token which does not exist in the database at all, the Fisher value
is computed as 0.52, which also seems odd, especially since my robx is
0.46.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>Seeing as this is the case, it would seem that 0.5
is not the mid point for token scoring.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>> Point 2: Message scores are the result of a
chi-square test and<BR>> bogofilter normalizes the result to the 0..1
interval. <BR></FONT></DIV>
<DIV><FONT face=Arial size=2>In the extreme case of having only one
recognizeable token, the email should score as the score of the single
token, right? This has implications for where the cutoffs can be, where
robx can be, and the size of min_dev, as the probability of the whole email
becomes entangled with the probability of a single token. Now they're both
oranges, but maintaining certain properties of apples. Correct me if I'm
wrong. Expanding this to multiple tokens should logically keep this
entanglement to some extent, perhaps becoming less so with more tokens.
Assuming this entanglement exists, then there is a paradox when tokens are
centered on 0.5 and emails are centered elsewhere.</FONT></DIV>
<DIV><BR><FONT face=Arial size=2>> Here's a bit more
detail:<BR>></FONT></DIV>
<DIV><FONT face=Arial size=2>> Step 1 of scoring a message is to score each
token. This gives<BR>> probability scores, which are centered around
0.5. These values are<BR>> (roughly) linear, with 0.0 meaning
"completely hammish", 0.5 menaing "no<BR>> clue", and 1.0 meaning "completely
spammish". min_dev applies to these<BR>> values. </FONT></DIV>
<DIV><FONT face=Arial size=2>><BR>> Step 2 is to apply the bayesian
computation to these probabilities. This<BR>> produces another
probability. This value is also linear (in the same<BR>> sense as the
step 1 value).</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>Here are just a few tokens from
the "bogofilter -vvv" output of a recent unsure spam I
received...</FONT></DIV><FONT face=Arial size=2></FONT>
<DIV><FONT face=Arial size=2></FONT><BR><FONT face="Courier New"
size=2>
n pgood
pbad fw U</FONT></DIV>
<DIV><FONT face="Courier New"
size=2>"Windows"
4481 0.075187 0.015459 0.170552
+<BR>"Graphics"
1834 0.029372 0.006377 0.178415
+<BR>"Microsoft"
4154 0.064300 0.014523 0.184265
+<BR>"expensive"
547 0.008052 0.001927 0.193231
+<BR>"need"
16181 0.220685 0.057636 0.207088
+<BR>"head:tanderso"
7114 0.092198 0.025512 0.216744
+<BR>"more"
44110 0.561805 0.158539 0.220088
+<BR>"Office"
4060 0.041733 0.014949 0.263739 +<BR></DIV></FONT>
<DIV><FONT face=Arial size=2>And this is their corresponding "bogoutil -p"
output:</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face="Courier New"
size=2>
spam good
Fisher<BR>Windows
3818 663
0.170541<BR>Graphics
1575 259
0.178387<BR>Microsoft
3587 567
0.184253<BR>expensive
476 71
0.193144<BR>need
14235 1946
0.207085<BR>head:tanderso
6301 813
0.216738<BR>more
39156 4954
0.220087<BR>Office
3692 368 0.263730</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>
<DIV><FONT face=Arial size=2>My config is as follows: robx=0.46, robs=0.2,
min_dev=0.2, spam_cutoff=0.465, ham_cutoff=0.1</FONT></DIV></FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>As you can see, all of these terms occur profusely
in spam and comparably little in ham, but they are all contributing to the ham
score! I don't understand how this is a linear relationship centered
around 0.5. Am I missing something?</FONT></DIV><FONT face=Arial
size=2></FONT><FONT face=Arial size=2></FONT>
<DIV><BR><FONT face=Arial size=2>> Step 3 applies the inverse chi-square
test. This looks at the step 2<BR>> score and the number of tokens
comprising it and computes a value<BR>> indicating the "certainty" with which
the score represents ham or spam.<BR>> If I remember what little I know of
statistica, this "certainty" is on a<BR>> bell curve. The actual
computed value ranges between -1 and +1 and<BR>> bogofilter normalizes it to
a value between 0 and 1.<BR>> <BR>> Step 4 applies the ham_cutoff and
spam_cutoff values to classify the<BR>> message as ham, spam, or
unsure.<BR></FONT></DIV>
<DIV><FONT face=Arial size=2>Here's the result of that:</FONT></DIV>
<DIV><FONT face=Arial size=2> </DIV></FONT>
<DIV><FONT face=Arial size=2><FONT face="Courier New"> X-Bogosity: Unsure,
tests=bogofilter, spamicity=0.318373, version=0.17.5<BR> int
cnt prob spamicity histogram<BR> 0.00
7 0.033940 0.014140 #######<BR> 0.10 6 0.173187 0.061171
######<BR> 0.20 8 0.248424 0.127623 ########<BR>
0.30 0 0.000000 0.127623<BR> 0.40 0
0.000000 0.127623<BR> 0.50 0 0.000000 0.127623<BR>
0.60 0 0.000000 0.127623<BR> 0.70 2
0.727572 0.192375 ##<BR> 0.80 3 0.837395 0.289786
###<BR> 0.90 4 0.972075 0.449355
####</FONT><BR></FONT></DIV>
<DIV><FONT face=Arial size=2>As you can see, my 0.2 min_dev is keeping all of
the tokens between 0.3 and 0.7 from contributing to the final score.
However, those tokens in the 0.1-0.3 range are not very hammy (eg: professional,
office, software, $60, etc), while the ones between 0.5 and 0.7 are actually
quite spammy (eg: Adobe, Photoshop, etc). This email would probably score
appropriately if the min_dev range was centered between my cutoffs near
0.3.</FONT><FONT face=Arial size=2></DIV>
<DIV><BR>> Both steps 1 and 4 can be considered as having "range centers"
and<BR>> "range widths". This similarity does not mean that these steps
have<BR>> comparable centers or comparable widths. I've run bogotune
with a<BR>> variety of test corpora and looked at its recommendations.
It generall<BR>> recommends a spam cutoff slightly above 0.5 (often a value
like<BR>> 0.500010) and a ham cutoff much below 0.5 (at least 0.125).
This lack<BR>> of symmetry for final score is a further indication of apples
and<BR>> oranges, i.e. scoring tokens with a symmetric and centered
exclusion<BR>> interval produces inverse chi-square results with a
differently sized<BR>> and centered exclusion interval.<BR></DIV>
<DIV>Actually, the fact that you've fixed 0.5 for tokens and not for emails, and
that bogotune gives a non-0.5 center for emails indicates to me that maybe
the center for tokens should be freed up too. Think about it this
way... the reason your cutoffs are shifted to the ham side is because you're
finding that emails classified by bogofilter tend to err on the hammy side of
0.5. Ideally, spam emails would score near 1.0, hams near 0.0, and unsures
should center around 0.5. This would be a balanced bell curve. But
if you fixed your cutoffs around 0.5, you would get lots of false
negatives. In order to compensate for this fact, you shift your cutoffs
down. But you wouldn't have to shift your cutoffs down if bogofilter
scored emails clustered at the extrema with unsures at 0.5. You could just
as easily fix the cutoffs around 0.5 and free up the min_dev center to achieve
the same effect as we get today. Tokens would shift toward the spammy side
to compensate for their innate bias toward ham, and emails would overall score
with 0.5 being unsure. But this would produce the same problem as today as
well... unnecessary false negatives. The solution still seems to be to
free up both of these "range centers" to balance out inequities caused by other
parameters.</DIV>
<DIV> </DIV>
<DIV>> As I've indicated, I'm willing to add (on an experimental basis)
a<BR>> parameter for specifying the center of the exclusion interval.
At the<BR>> moment I've got no clue how much of a difference doing that
will<BR>> make. So far, however, NOBODY has responded to those
suggestions.<BR></DIV>
<DIV>If you could add the parameter to the config to change the min_dev center,
but to default it to 0.5, then people could experiment with it without effecting
any existing configurations.</DIV>
<DIV> </DIV>
<DIV>> Possibly, the center of the exclusion region should be the robx
value.<BR>> It isn't clear. Bogotune could be modified to vary the
center of the<BR>> exclusion interval. It might be interesting to see
what it finds to be<BR>> the best value. All it takes is
time.<BR></DIV>
<DIV>I also thought that robx would be a good value at first. However, the
entire point of robx is to bias unknown or little known tokens. If you use
this value to establish the center-point for all tokens, then it loses its
quality of biasing. Robx is not an appropriate value, but the mid point
should probably be somewhere near robx, or rather robx should be thought of as
an offset from the mid point rather than from 0.5.</DIV>
<DIV> </DIV>
<DIV>Tom</DIV>
<DIV> </DIV></FONT></BODY></HTML>