Training from scratch.

Tom Allison tallison at tacocat.net
Sun Jul 18 04:33:53 CEST 2004


Tom Anderson wrote:

> I would set the min_dev to a relatively high value when your database is
> small.  This way, more email is properly classified as unsure (since
> bogofilter really is unsure at this point on most things) and not
> misclassified.  Only after you see certain tokens multiple times will they
> start to effect scoring.  Otherwise it would be quite possible to
> misclassify emails due to common words only showing up in spam at first, and
> then you get a false positive when they show up in a ham.  With a higher
> min_dev, it should be a relatively smooth transition from mostly unsures to
> mostly correct classifications, without ever having lots of
> misclassifications.
> 

Interesting thoughts.  I would have gone with a very small min_dev for 
the following reason:

When you first start out, most of your words will be near robx 
(typically ~0.50) and as such will not contribute much to the evaluation 
of your ham/spam if your min_dev is very high or until the number of 
incidents for a give token are high enough such that the robx/robs 
"effect" is overcome.

That said, I think it would be more reasonable to put in a high robs 
value with a smaller min_dev value to prevent low incident tokens from 
contributing to the score of bogofilter.

With early training I think bogofilter can use all the information at 
it's disposal, hence min_dev should not be very large.  But I would be 
certain to put robx withing 0.5+/- min_dev for quite some time.




More information about the Bogofilter mailing list