GETTING.STARTED (rev 2)

Wed Oct 27 13:07:11 CEST 2004

On Wed, 27 Oct 2004 12:07:03 +0200 (CEST)
Boris 'pi' Piwinger wrote:

.../snip]...

> > Eh??  Bogotune uses the wordist, and the ham and spam corpora you
> > specify, and then does a rather exhaustive scan of possible scoring
> > parameters to find what gives the best results.  As you know,
> > bogotune has minimum requirements for number of messages registered
> > in wordlist.db and minimum numbers of messages for the ham and spam
> > corpora used in the tuning process.
> 
> Right, so it is not usable for pure train-on-error approaches.

Usually when I run bogotune, I start with an empty wordlist and 10K-15K
ham and 10K-15K spam.  I then use 20%-30% of the messgages to populate
the wordlist and run bogotune on the remaining 70%-80% of the messages.
Given this initial training before tuning, how the wordlist is
_normally_ maintained doesn't enter into the picture.

This build/tune method also simulates real usage (in a sense).  In
actual usage, the wordlist contains a past history of ham and spam
received.  The wordlist is then applied to incoming messages that new
and have never been seen.  I think of this as "predictive", i.e. using
past experience to predict scoring of unseen (future) messages.

Using the working wordlist and messages that have already been processed
by bogofilter (and possibly entered in the wordlist) loses this
predictive aspect of real life.

> > The only effect that training method has is whether or not enough
> > messages are present in the wordlist for bogotune to work
> > successfully.
> 
> There is still this unsolved question if tuning works for
> train-on-error(or it becomes somehow circular). My guess is that it
> works, but is practically impossible.

Very possibly :-<

David