New user and question
Thomas Anderson
tanderson at orderamidchaos.com
Tue Oct 26 21:17:18 CEST 2010
On 10/25/2010 6:57 PM, RW wrote:
> On Mon, 25 Oct 2010 16:03:32 -0400
> Thomas Anderson<tanderson at orderamidchaos.com> wrote:
>> I recommend training to exhaustion. That is, when a false positive,
>> false negative, or unsure shows up, first you train it, then you
>> check it again as if the same exact email arrived another time, and
>> if it still doesn't classify correctly, train it again -- repeat
>> until it classifies correctly.
>
> In my my experience that's ineffective with default settings because the
> influence of new hapaxes and low-count tokens virtually guarantees
> correct identification on the second test - unless you use a very large
> value of "robs" that would be unsuitable for normal classification. It
> makes more difference if you do it iteratively on corpora.
I've had great success doing it this way. These are my settings:
robx=0.69
robs=0.33
min_dev=0.2
spam_cutoff=0.7
ham_cutoff=0.3
The method may indeed be less necessary for small word lists. But by
the time you've had a few tens of thousands of emails through
bogofilter, a single training often has little effect. Doing the
exhaustive training method ensures that your manual indication that a
given email is ham or spam is actually reflected in subsequent
classifications.
That said, I would use the method even if starting from a brand spanking
new word list since there is no cost in doing so and the beneficial
effect may be noticeable fairly quickly.
BTW, I would amend my initial response to advise the original poster,
Doug, to also set his spam_cutoff a little lower in order to capture
more spammy unsures as spam. But to prevent false positives, robx
should always be less than spam_cutoff.
Tom
More information about the Bogofilter
mailing list