bogotune and "exhaustion"

David Relson relson at osagesoftware.com
Mon Mar 29 15:17:08 CEST 2004


On Mon, 29 Mar 2004 07:43:46 -0500
Tom Allison wrote:

> David Relson wrote:

...[snip]...

> Is there some way I can us bogotune without a wordlist?
> I would think this might be the most unbiased way of determining 
> parameter settings given that you know the tokens and the expected 
> outcome for each email, that you would use the emails to determine
> which parameters would provide the most accurate selection of
> parameters with which to build a wordlist upon.
> 
> I'm thinking of this in an entirely backwards manner.
> 
> But if I start with a really large sample of email that is accurately 
> sorted into spam/ham piles,  Is it possible to then determine the most
> 
> accurate parameter settings such that, after building my wordlist from
> 
> scratch using these email piles, I will have an optimum scoring
> accuracy?
> 
> And then, I could either use my existing wordlist, or rebuild it from 
> scratch based on those findings.
> 
> Crazy?

Hi Tom,

Two details:

First, proper use of bogotune calls for two messages sets - the messages
used for training and the messages used for tuning.  If the same
messages are used for in both places, "too few high scoring ham" is the
likely result.  If different sets of messages are used, the problem is
much less likely to occur.  My usual procedure is to divide my message
set (dealing the messages like dealing cards), train with 1 part and
tune with the other.

Second, bogotune can be run without a wordlist using the "-D" option. It
will read all the messages into memory, divvy them up and use one part
for training and the other for tuning.

David




More information about the Bogofilter mailing list