default parameters - new vs old vs mine

David Relson relson at osagesoftware.com
Wed Mar 31 03:55:41 CEST 2004


On 30 Mar 2004 20:27:07 -0500
Tom Anderson wrote:

> I wonder why they score so spammy for you.  They're not even near my
> unsure territory.

Here's the answer for one of the messages.  Offhand it looks like a
superdudes.net message has been registered as spam.

[relson at osage src]$ bogofilter -vvv -c bogo-new.cf -I
parmcheck.0329.d/6115 -vvv
X-Bogosity: Spam, tests=bogofilter, spamicity=0.996868,
version=0.17.4.cvs
                                     n    pgood     pbad      fw     
invfwlog    fwlog  U
"rcvd:apache"                      137  0.001750  0.000100  0.053893 
-0.05540  -2.92075 +
"rcvd:envelope-from"              5747  0.010617  0.082154  0.885552 
-2.16763  -0.12154 +
"rcvd:eric"                      11885  0.012073  0.182176  0.937846 
-2.77815  -0.06417 +
"head:eric"                      13631  0.012541  0.210561  0.943789 
-2.87865  -0.05785 +
"to:eric"                        11927  0.009456  0.186124  0.951653 
-3.02936  -0.04955 +
"head:www8.superdudes.net"           1  0.000000  0.000017  0.991605 
-4.78017  -0.00843 +
"janthegreat"                        1  0.000000  0.000017  0.991605 
-4.78017  -0.00843 +
"rcvd:www8.superdudes.net"           1  0.000000  0.000017  0.991605 
-4.78017  -0.00843 +
"rtrn:www8.superdudes.net"           1  0.000000  0.000017  0.991605 
-4.78017  -0.00843 +
N_P_Q_S_s_x_md                       9  2.00e-06  9.94e-01  9.97e-01 
1.78e-02  5.20e-01 0.375

Looking further at the 3 messages, it appears that they were
misclassified at some past time and that the wordlists were never
corrected.  The correction has been made and the test script will
be run yet again :-)

> > Remember that the above scores are "after the fact", i.e. messages
> > have been entered in the wordlists and are now being scored.  The
> > scores the messages get today are different from the scores they got
> > when they arrived because the wordlist is different.
> 
> True.  Still.
> 
> I don't keep many emails around, but I think maybe I'll scrape
> together whatever is saved in my client to test some numbers
> against... do you have a simple procedure for running bogofilter on a
> bunch of emails and collecting the results?  A script perhaps?

Attached is the script used.  It reflects my directory structure and my
MH formatted mail archive.  Without a doubt you'll need to change it to
use it.

> > > Just off-hand, I would suggest decreasing robx and increasing robs
> > > to better bias it.  But that's just based on my experience.
> > 
> > You're free to say that, however I've seen bogotune results that
> > contradict that idea.
> 
> Again with the bogotune... considering how just about everyone
> involved in bogofilter has expressed how they aren't entirely certain
> exactly how the various algorithms actually work together and why
> changing certain values one way or the other has the effect it does,
> an aweful lot of faith is put into bogotune to magically come up with
> the best numbers. Admittedly, I haven't cracked open the source on it
> to audit the procedure it uses, but to me this constant reliance on it
> despite contradictory claims seems slightly out of place.  Maybe, just
> maybe, bogotune doesn't produce the best possible numbers.  Maybe it
> finds a local maxima instead of a global one.  Just my opinion, I
> could be wrong.

Bogotune does a two pass grid search with a variety of robx, robs, and
min_dev values.  At the end of each pass it looks at the 10 best results
and picks the best one that avoids certain "outlying" criteria.  The
first pass is a coarse scan and the second pass is a finer scan that
centers around the result of the first pass.  Bogotune finds a good set
of parameters and has not been advertised as finding the best possible
parameters.  There have always been disclaimers in what we say about its
results because no one fully understands the interactions between robs,
robx, and min_dev.  We can observe the behavior but we lack a
mathematical model to fully explain them.

As to reliance on bogotune, we use it because it's the best tool we
have.  When someone finds a better way of finding parameters, we'll
adopt it.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: parmcheck.0328.sh
Type: application/x-sh
Size: 1697 bytes
Desc: not available
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20040330/2c057613/attachment.sh>


More information about the Bogofilter mailing list