how to bogotune?

David Relson relson at osagesoftware.com
Thu Sep 30 02:40:09 CEST 2004


On Wed, 29 Sep 2004 20:26:03 -0400
David Relson wrote:

> Trevor,
> 
> Let's work on this together, OK?  I'll explain and you let me know
> when it becomes clear.  Then we can working on fixing the man page so
> it's intelligible.  OK?
> 
> ------
> 
> Bogofilter uses a BerkeleyDB database for storing token info from the
> messages with which it has been trained.  This file is named
> "wordlist.db" and is often called "the wordlist".
> 
> For tuning bogotune needs a wordlist representing with a decent amount
> of training history and it needs some additional (untrained messages)
> to run the tuning tests on.  Experience has shown that the wordlist
> needs the contents of 500 each spam and non-spam messages (or more)
> and that there also need to be 2000 each of spam and non-spam messages
> used for the tuning process.  Thus, in total, 5000 messages is the
> minimum needed.
> 
> Given the 5000 messages, use 500 each of the ham and spam and build a
> new wordlist and use the other 2000 of each for the tuning part. 
> Commands
> 
>    mkdir new_dir
>    bogofilter -v -d new_dir -s < mbox.with.500.spam
>    bogofilter -v -d new_dir -n < mbox.with.500.spam
>    bogotune -vv -d new_dir -s mbox.with.2000.spam -n
>    mbox.with.2000.ham

Trevor,

A minor brain fart on my part caused me to switch the 500's and 2000's
in the above message.  The wordlist should have 2000 each and the tuning
set requires 500 of each.  Note, these are minimum numbers.  As larger
counts (particularly for the tuning set) will do better, when you have
more that 2500 each, use 2000 for training and the rest for tuning.

David

> 


-- 
David Relson                   Osage Software Systems, Inc.
relson at osagesoftware.com       Ann Arbor, MI 48103
www.osagesoftware.com          tel:  734.821.8800



More information about the Bogofilter mailing list