New user and question

RW rwmaillists at googlemail.com
Tue Oct 26 00:57:19 CEST 2010


On Mon, 25 Oct 2010 16:03:32 -0400
Thomas Anderson <tanderson at orderamidchaos.com> wrote:

> On 10/25/2010 12:51 PM, Doug wrote:
> > I am new to Bogofilter. Had been using Spam Assassin for years and
> > wanted to try

You might want to try scoring Bogofilter into SpamAssassin, setting it
for multiple word tokenization, so that it complements SpamAssassin's
Bayes. I find that although Bogofilter (multiple word) and Bayes both FP
occasionally they tend to do it on different mails, so it's legitimate
to have both.

I do something similar (I use DSPAM too), and I find that those that
Bogofilter classifies as unsure usually pick-up enough SA points to get
caught easily.

> > My problem is the unsure's are not going down and the majority of
> > them have Viagra in the subject. It is not obfuscated in any way.
> > I see few if any Viagra emails in the spam mail.  Am I doing
> > something wrong? I have probably feed several hundred or more
> > unsure's like this so far. Is there a way to strongly add a word.
> 
> I recommend training to exhaustion.  That is, when a false positive, 
> false negative, or unsure shows up, first you train it, then you
> check it again as if the same exact email arrived another time, and
> if it still doesn't classify correctly, train it again -- repeat
> until it classifies correctly.

In my my experience that's ineffective with default settings because the
influence of new hapaxes and low-count tokens virtually guarantees
correct identification on the second test - unless you use a very large
value of "robs" that would be unsuitable for normal classification. It
makes more difference if you do it iteratively on corpora. 



More information about the Bogofilter mailing list