New user and question

Thomas Anderson tanderson at orderamidchaos.com
Tue Oct 26 21:17:18 CEST 2010


On 10/25/2010 6:57 PM, RW wrote:
> On Mon, 25 Oct 2010 16:03:32 -0400
> Thomas Anderson<tanderson at orderamidchaos.com>  wrote:
>> I recommend training to exhaustion.  That is, when a false positive,
>> false negative, or unsure shows up, first you train it, then you
>> check it again as if the same exact email arrived another time, and
>> if it still doesn't classify correctly, train it again -- repeat
>> until it classifies correctly.
>
> In my my experience that's ineffective with default settings because the
> influence of new hapaxes and low-count tokens virtually guarantees
> correct identification on the second test - unless you use a very large
> value of "robs" that would be unsuitable for normal classification. It
> makes more difference if you do it iteratively on corpora.

I've had great success doing it this way.  These are my settings:

robx=0.69
robs=0.33
min_dev=0.2
spam_cutoff=0.7
ham_cutoff=0.3

The method may indeed be less necessary for small word lists.  But by 
the time you've had a few tens of thousands of emails through 
bogofilter, a single training often has little effect.  Doing the 
exhaustive training method ensures that your manual indication that a 
given email is ham or spam is actually reflected in subsequent 
classifications.

That said, I would use the method even if starting from a brand spanking 
new word list since there is no cost in doing so and the beneficial 
effect may be noticeable fairly quickly.

BTW, I would amend my initial response to advise the original poster, 
Doug, to also set his spam_cutoff a little lower in order to capture 
more spammy unsures as spam.  But to prevent false positives, robx 
should always be less than spam_cutoff.

Tom




More information about the Bogofilter mailing list