bogotune and "exhaustion"

Mon Mar 29 14:43:46 CEST 2004

David Relson wrote:
> On Mon, 29 Mar 2004 07:02:20 -0500
> Tom Allison wrote:
> 
> 
>>Ran into a cute catch-22.
>>
>>bogotune wants a sample size that includes some high scoring ham and
>>low scoring spam (maybe) to get a good calculation of what to set the 
>>parameters at.
>>
>>run corrections to exhaustion tends to remove that high scoring ham, 
>>giving you a big fat remark and a shortened bogotune output.
>>
>>So, it seems that I can do one or the other but not both on my
>>archives. Or I'm doing something wrong.
> 
> 
> Hi Tom,
> 
> What you say sounds reasonable.  Bogotune needs the variation in ham
> scores so that it can adjust the number of false positives.  I can
> believe that train-to-exhaustion conflicts with this.  It's also
> possible that bogotune could be modified to not need the high-scoring
> ham.  This possibility will need some thought since a different way of
> selecting spam_cutoff will be needed.  I'll think about it, but can't
> guarantee a solution.
> 
> David
> 

Is there some way I can us bogotune without a wordlist?
I would think this might be the most unbiased way of determining 
parameter settings given that you know the tokens and the expected 
outcome for each email, that you would use the emails to determine which 
parameters would provide the most accurate selection of parameters with 
which to build a wordlist upon.

I'm thinking of this in an entirely backwards manner.

But if I start with a really large sample of email that is accurately 
sorted into spam/ham piles,  Is it possible to then determine the most 
accurate parameter settings such that, after building my wordlist from 
scratch using these email piles, I will have an optimum scoring accuracy?

And then, I could either use my existing wordlist, or rebuild it from 
scratch based on those findings.

Crazy?