Training from scratch.

Tom Anderson tanderso at
Thu Jul 15 15:21:21 CEST 2004

From: "Tom Eastman" <tom-lists at>
> I've never really played with the various constants you can set for the
> calculations but I was wondering... do you think it might be appropriate
> set the min_dev to a very low value when your database is still very
> That way more tokens will be taken into account, and you can set it higher
> Once you have a better range of tokens in your database.

I would set the min_dev to a relatively high value when your database is
small.  This way, more email is properly classified as unsure (since
bogofilter really is unsure at this point on most things) and not
misclassified.  Only after you see certain tokens multiple times will they
start to effect scoring.  Otherwise it would be quite possible to
misclassify emails due to common words only showing up in spam at first, and
then you get a false positive when they show up in a ham.  With a higher
min_dev, it should be a relatively smooth transition from mostly unsures to
mostly correct classifications, without ever having lots of

> My other question was with thresh_update.  I want to set this to a very
> value so that emails that score very closely to 0.00 or 1.00 are not added
> the database, but it occurred to me that when the database is very small
> quite possible that a spam could in fact get a score of 0.000.  If I then
> correct the classification with my script it might try to unregister from
> nonspam something that wasn't actually registered as nonspam to begin

In the beginning, if you set a high min_dev, you should minimize the
possibility of such a strong misclassification.  Nonetheless, it shouldn't
cause any problems to unregister a spam that hasn't been registered yet,
unless perhaps your total email count were to go below one (if there is no
bounds checking).


More information about the Bogofilter mailing list