Result Based on a Single Token

Tue Oct 2 16:56:27 CEST 2007

On Tue, 2 Oct 2007 07:03:26 -0400
David Relson <relson at osagesoftware.com> wrote:

> On Tue, 2 Oct 2007 04:35:21 +0100
> RW wrote:
> > Like I said, it's not my Bogofilter, it's the filter at
> > http://www.tuffmail.com. I've trained on error/unknown for a year,
> > and it's had about 250 ham + 350 spam.
> > 
> > Irrespective of how Tuffmail has tuned it, it seems fundamentally
> > wrong that any Bayesian spam filter can produce an output of 0.996
> > based on a single token. Shouldn't there be some kind of sanity
> > check? 
> 
> Hello RW,
> 
> Looking at the "-vvv" output, its apparent tuffmail has configured
> bogofilter to ignore tokens with scores between 0.164021 and 0.196600.
> This seems a bit extreme.
> 
> Many email clients allow client side filtering, i.e. you can run
> bogofilter with _your_ parameters.  If I'm reading the mail headers
> correctly, you are running Claws Mail 3.0.0 on a FreeBSD machine.
> I know that claws-mail allows the client side filtering above.  Why
> not give it a try?
>
There are lots of reasons why I prefer having my mail on a remote
server, and my spam-filtering server-side, but I don't really want to
get into that. Tuffmail's Bogofilter actually does work very well, and
the overwhelming majority of spams/hams are identified on a large number
of tokens.

What bothers  me about this is that it's an avoidable false-positive.
It arises because the edge-case isn't handled, in which the
token-selection rules don't leave enough information for a sensible
result.

IMO there are two sensible ways of handling this; either tinker with
the token selection so it always produces a minimum number of tokens,
or simply return 0.5 when there isn't enough information to work with.