Result Based on a Single Token

Tue Oct 2 13:03:26 CEST 2007

On Tue, 2 Oct 2007 04:35:21 +0100
RW wrote:

> On Mon, 01 Oct 2007 19:42:00 -0400
> Thomas Anderson <tanderso at oac-design.com> wrote:
> 
> > On Mon, 2007-10-01 at 22:38 +0100, RW wrote:
> > > I just noticed an email to a mailing list where it seems that a
> > > very high spam probability was based on a single token that had
> > > only been seen twice.
> > > 
> > > The filtering was done by Tuffmail, so I don't now any details
> > > about the version, or configuration of Bogofilter.
> > > 
> > > Is this normal behaviour? It seems a bit reckless to me.
> > > 
> > > 
> >
> > Looks to me like a case of weak training and untuned config.
> > Firstly, train on errors.  Then, adjust your robx, robs, min_dev,
> > spam_cutoff, and ham_cutoff.  Letting a statistical filter screen
> > your messages without knowing anything about the filter seems a bit
> > reckless to me. Bogofilter will only do what you tell it to,
> > including what to consider and what to ignore statistically via
> > cutoffs and ranges.
> 
> Like I said, it's not my Bogofilter, it's the filter at
> http://www.tuffmail.com. I've trained on error/unknown for a year, and
> it's had about 250 ham + 350 spam.
> 
> Irrespective of how Tuffmail has tuned it, it seems fundamentally
> wrong that any Bayesian spam filter can produce an output of 0.996
> based on a single token. Shouldn't there be some kind of sanity
> check? 

Hello RW,

Looking at the "-vvv" output, its apparent tuffmail has configured
bogofilter to ignore tokens with scores between 0.164021 and 0.196600.
This seems a bit extreme.

Many email clients allow client side filtering, i.e. you can run
bogofilter with _your_ parameters.  If I'm reading the mail headers
correctly, you are running Claws Mail 3.0.0 on a FreeBSD machine.
I know that claws-mail allows the client side filtering above.  Why
not give it a try?

Regards,

David