how to bogotune?

Thu Sep 30 05:30:07 CEST 2004

On Wed, 29 Sep 2004 23:30:19 -0300
Trevor Smith wrote:

> On September 29, 2004 9:26 pm, David Relson wrote:
> 
> > Let's work on this together, OK?  I'll explain and you let me know
> > when it becomes clear.  Then we can working on fixing the man page
> > so it's intelligible.  OK?
> 
> :-) Thanks. (Honestly, I sometimes do not seem like an imbecile.)
> 
> > For tuning bogotune needs a wordlist representing with a decent
> > amount of training history and it needs some additional (untrained
> > messages) to run the tuning tests on.  Experience has shown that the
> > wordlist needs the contents of 500 each spam and non-spam messages
> > (or more) and that there also need to be 2000 each of spam and
> > non-spam messages used for the tuning process.  Thus, in total, 5000
> > messages is the minimum
> 
> (picking up on the typo correction from your *next* email...)
> 
> That's the major clarification I needed (the 2000+2000 is for a *new* 
> wordlist.db and the 500+500 should be untrained).
> 
> One more clarification (it may have been covered somewhere, but I'm
> forgetting if I've ever seen a definitive answer): can I just use the
> wordlist.db I already have? Or is it preferable to pick 2000+2000
> spam/ham that have not yet been trained on to use as the new/temporary
> wordlist? Since I already have a wordlist with 10- or 20,000 messages
> in them, it seems unnecessary to build a new, smaller wordlist.db.
> Unless it's an issue of having a correct ratio of ham/spam in the
> wordlist.db, which I couldn't begin to guess at...

Honestly, you can do it either way.  Bogotune works to find the
parameters that give the minimum false negatives, i.e. "spam getting
through".  It hunts for the combination of robs, robx, min_dev, ns_esf,
and sp_esf parameters that make this possible.  Its  first (coarse) scan
tests 3675 combinations and the second (fine) scan checks a varying
number of combinations.  What it finds is a local minimum -- a set of
values that give fewer false negatives than any of the surrounding sets.
Using a different set of messages for tuning will find a slightly
different combination.  Either set of parameters will do a good job for
you.

As an analogy, the process is kind of like looking for a low spot in a
gravel driveway.  There are a lot of low spots (local minima) and
they're pretty similar to one.  Starting your searches in different
places (using different sets of messages for tuning) will give different
results.  Since they're comparable to one another, it doesn't much
matter.

Hopefully the analogy has helped :-)

> I am still (until told definitively otherwise) assuming that it is, in
> fact, either the above (unbalanced spam/ham counts fed through
> bogofilter into the wordlist) or else the fact that the new spam/ham I
> fed into bogotune were already trained on that caused bogotune to give
> me my 0.000 and 0.000 recommended thresholds, since my wordlist.db has
> tons of emails fed through it over the months, and since I fed ~1000
> each of spam/ham into bogotune.

Having 0.000 thresholds shouldn't have happened. I'll have to look at
the earlier message.  However, that'll have to wait for tomorrow
evening.

Regards,

David