Self-adjusting bogofilter.cf settings

Sat Feb 28 05:26:10 CET 2004

Mark Constable wrote:
> I did not mean that any tweaking should be on a daily basis, only
> the logging of certain state values like wordlist counts etc.
> 
> As an ISP admin, and an analogy, I track traffic and hard drive usage
> and produce pretty graphs which after 1/2 a year or so indicate trends
> to the point where it's possible to predict when bandwidth and HDDs
> have to be upgraded.
> 
> If various bogo* tools logged certain important settings, wordlist
> counts, whatever, on a daily basis then I could _imagine_ that if
> some parameters reached certain limits vs other parameter settings
> then it could be probable that x, y or z parameter could be tweaked,
> plus or minus some value according to yet another parameter, or sum
> of them over time, and to write out a new bogofilter.cf which then
> gets tracked and perhaps self-adjusted yet again down the track. What
> these "rules of adjustment" could be is losely tied up in the group
> understanding of some/most of the people on this list. If these rules
> could be extracted to code then anyone, even me, could start to apply 
> and fine tune bogofilter usage even better than "good enough".
> 
> Sorry about the long rave. If I knew what I was talking about I would 
> provide precise details, examples and even code.
> 

I thought that bogotune gave you a lot of this information for you.
The caveat is that you need a lot of spam/ham to work with in order for 
anything reasonable to be applied.

That said, I have found the Histogram to be fairly informative.
And the command line option '-vv'.

These two will give you an indication of how your mail scored.

I would expect that you want to examine are the most typical Unsure 
emails.  I have a few that come in that are straight to spam, but I have 
a very skewed ham token list and can't really blame them.  So I 
generally ignore them for the time being as their cause for 
misinterpretation is because of my poor sample set.  The Unsure ones are 
more interesting because they are flirting on the edge of what the 
cut-offs should/could be.

The following is an extremely "lay person" interpretation of what the 
parameters and filters do.  I didn't pay enough attention in statistics 
and probability to really earn many points here, but this is my guess.

Speaking in grotesque terms, and potentially to some degree of mild 
errors, chi-square distributions tend to create a distribution that is 
shaped like a "U" with high probabilities on the extremes.  The 
ham_cutoff and spam_cutoff are arbitrary lines of demarkation along this 
"U" which says, "These scores are HAM, These scores are SPAM, and I 
don't know about the stuff inbetween".  But the in between is this very 
thin profile at the bottom of the "U".

The min_dev works on specific, individual tokens and not the total 
scores (like the cutoffs do).  It simpley states that every token or 
word that has a token value of 0.5 +/- min_dev should be ignored for 
scoring purposes.  This is based on the historical evidence that spam is 
made up of neutral words (the and meanwhile) and a few really strong 
words (viagra, mortgage, nigeria).

As for robx/robs:
robx is the starting point.  If you've never seen the word "aardvark", 
then robx will assign to it an arbitrary value for starting measure.  If 
you have a fairly limited dictionary then you can probably push robx 
towards the spammy side of things.  But if you have a varied dictionary 
in use then you might want to push new words towards the hammy side just 
a bit.  Generally speaking, I think most new words tend to be ham since 
they've been selling viagra for a while now (as an example).

I'm not really sure about robs, this might be the hardest to understand 
but I'm taking a guess.  When you have a new word, it starts at a value 
of robx.  As more occurrences come in and you provide more feedback to 
bogofilter, the score of that token moves away from the initial robx 
value towards 0 or 1.  I interpret robs as a very fine tuning "drag" or 
friction on these words.  Another way of putting it is that robs 
prevents tokens from moving too far from robx very quickly.  It implies 
a probationary limitation or correction on their score to push them back 
towards robx until their history (number of occurrences) increases 
sufficiently to be trusted.

That's how I see these parameters.

min-dev creates a "blind spot" around 0.5 for ambiguous and/or common 
words that won't tell you definitively ham/spam.

*_cutoffs simply mark where you are willing to state ham/spam/unsure and 
to help identify the more "fringe" emails by putting them into the 
unsure bucket.

robx is where you start with new words.

robs implies a quarentine or probationary correction on new words to 
keep them from dominating more established words.  This doesn't seem to 
have much sensitivity as I've seen discussions ranging on orders of 
magnitude so I would tune this dead last.

Am I close?