Training from scratch.

Tom Anderson tanderso at oac-design.com
Thu Jul 15 09:21:21 EDT 2004


From: "Tom Eastman" <tom-lists at celleste.no-ip.org>
> I've never really played with the various constants you can set for the
> calculations but I was wondering... do you think it might be appropriate
to
> set the min_dev to a very low value when your database is still very
small?
> That way more tokens will be taken into account, and you can set it higher
> Once you have a better range of tokens in your database.

I would set the min_dev to a relatively high value when your database is
small.  This way, more email is properly classified as unsure (since
bogofilter really is unsure at this point on most things) and not
misclassified.  Only after you see certain tokens multiple times will they
start to effect scoring.  Otherwise it would be quite possible to
misclassify emails due to common words only showing up in spam at first, and
then you get a false positive when they show up in a ham.  With a higher
min_dev, it should be a relatively smooth transition from mostly unsures to
mostly correct classifications, without ever having lots of
misclassifications.

> My other question was with thresh_update.  I want to set this to a very
low
> value so that emails that score very closely to 0.00 or 1.00 are not added
to
> the database, but it occurred to me that when the database is very small
it's
> quite possible that a spam could in fact get a score of 0.000.  If I then
> correct the classification with my script it might try to unregister from
> nonspam something that wasn't actually registered as nonspam to begin
with!

In the beginning, if you set a high min_dev, you should minimize the
possibility of such a strong misclassification.  Nonetheless, it shouldn't
cause any problems to unregister a spam that hasn't been registered yet,
unless perhaps your total email count were to go below one (if there is no
bounds checking).

Tom



More information about the Bogofilter mailing list