Result Based on a Single Token

RW fbsd06 at mlists.homeunix.com
Tue Oct 2 05:35:21 CEST 2007


On Mon, 01 Oct 2007 19:42:00 -0400
Thomas Anderson <tanderso at oac-design.com> wrote:

> On Mon, 2007-10-01 at 22:38 +0100, RW wrote:
> > I just noticed an email to a mailing list where it seems that a very
> > high spam probability was based on a single token that had only been
> > seen twice.
> > 
> > The filtering was done by Tuffmail, so I don't now any details about
> > the version, or configuration of Bogofilter.
> > 
> > Is this normal behaviour? It seems a bit reckless to me.
> > 
> > 
>
> Looks to me like a case of weak training and untuned config.  Firstly,
> train on errors.  Then, adjust your robx, robs, min_dev, spam_cutoff,
> and ham_cutoff.  Letting a statistical filter screen your messages
> without knowing anything about the filter seems a bit reckless to me.
> Bogofilter will only do what you tell it to, including what to
> consider and what to ignore statistically via cutoffs and ranges.

Like I said, it's not my Bogofilter, it's the filter at
http://www.tuffmail.com. I've trained on error/unknown for a year, and
it's had about 250 ham + 350 spam.

Irrespective of how Tuffmail has tuned it, it seems fundamentally wrong
that any Bayesian spam filter can produce an output of 0.996 based on a
single token. Shouldn't there be some kind of sanity check? 




More information about the Bogofilter mailing list