Training from scratch.
Tom Allison
tallison at tacocat.net
Sun Jul 18 04:33:53 CEST 2004
Tom Anderson wrote:
> I would set the min_dev to a relatively high value when your database is
> small. This way, more email is properly classified as unsure (since
> bogofilter really is unsure at this point on most things) and not
> misclassified. Only after you see certain tokens multiple times will they
> start to effect scoring. Otherwise it would be quite possible to
> misclassify emails due to common words only showing up in spam at first, and
> then you get a false positive when they show up in a ham. With a higher
> min_dev, it should be a relatively smooth transition from mostly unsures to
> mostly correct classifications, without ever having lots of
> misclassifications.
>
Interesting thoughts. I would have gone with a very small min_dev for
the following reason:
When you first start out, most of your words will be near robx
(typically ~0.50) and as such will not contribute much to the evaluation
of your ham/spam if your min_dev is very high or until the number of
incidents for a give token are high enough such that the robx/robs
"effect" is overcome.
That said, I think it would be more reasonable to put in a high robs
value with a smaller min_dev value to prevent low incident tokens from
contributing to the score of bogofilter.
With early training I think bogofilter can use all the information at
it's disposal, hence min_dev should not be very large. But I would be
certain to put robx withing 0.5+/- min_dev for quite some time.
More information about the Bogofilter
mailing list