New user and question

Wed Oct 27 01:31:38 CEST 2010

On Tue, 26 Oct 2010 15:17:18 -0400
Thomas Anderson <tanderson at orderamidchaos.com> wrote:

> On 10/25/2010 6:57 PM, RW wrote:
> > On Mon, 25 Oct 2010 16:03:32 -0400
> > Thomas Anderson<tanderson at orderamidchaos.com>  wrote:

> >> -- repeat until it classifies correctly.
> >
> > In my my experience that's ineffective with default settings
> > because the influence of new hapaxes and low-count tokens virtually
> > guarantees correct identification on the second test 
> 
> I've had great success doing it this way.  These are my settings:
> 
> robx=0.69
> robs=0.33
> min_dev=0.2
> spam_cutoff=0.7
> ham_cutoff=0.3
> 
> The method may indeed be less necessary for small word lists.  But by 
> the time you've had a few tens of thousands of emails through 
> bogofilter, a single training often has little effect.  

You're using a value of robs that's ~20 times the default. And I'm
guessing you're also using spamitarium which presumably reduces the
number of new tokens. With Bogofilter defaults it wouldn't work because
repeated training would scarcely ever happen- the detail is important.

One thing that I do, which seemed like a good idea at the time, is to
use a higher value of robs for training-to-exhaustion than is used for
classification. The idea being that TTE can optimize the core-tokens
without detuning low-scoring pure tokens in the classification.