Training from scratch.

Thu Jul 15 16:48:14 CEST 2004

I would like to add to this conversation by asking what about after the
training?  How do I get it to start discarding spam that fall into the
unsure category?

I used to train my database with a known set of spam and
ham.  This worked great, however, I still get about 3 or so spam a day. 
These are the spam with the funny text mixed in.  For example:

draft alderman quasar arccos consortium caught eleventh bergen ache
arboreal worshipful crochet digital brunswick van nasal

I'm not exactly sure where I need to put the -o entry and what the
numbers should be.  How do you decide what the values should be?  There
is a reference to '-o 0.8,0.2' or something similar but I am not exactly
sure where this command option is supposed to go and how those two
numbers were decided.

Thank you to those who wrote this program, because it limits my incoming
spam to just a few instead of 15 to 20 a day per account.


On Thu, 2004-07-15 at 06:01, Tom Eastman wrote:
> On Friday 16 July 2004 01:21, Tom Anderson wrote:
> > I would set the min_dev to a relatively high value when your database is
> > small.  This way, more email is properly classified as unsure (since
> > bogofilter really is unsure at this point on most things) and not
> > misclassified.  Only after you see certain tokens multiple times will they
> > start to effect scoring.  Otherwise it would be quite possible to
> > misclassify emails due to common words only showing up in spam at first,
> > and then you get a false positive when they show up in a ham.  With a
> > higher min_dev, it should be a relatively smooth transition from mostly
> > unsures to mostly correct classifications, without ever having lots of
> > misclassifications.
> That's really surprising, I thought a high min_dev would have the opposite 
> effect -- that scores would be more likely to be close to 0.0 or 1.0.  
> I had intuited that a low min_dev would mean that there was more neautral-ish 
> tokens that would push the score towards 0.5.  
> Am I just confusing myself?
> 	Tom
