Result Based on a Single Token
RW
fbsd06 at mlists.homeunix.com
Tue Oct 2 16:56:27 CEST 2007
On Tue, 2 Oct 2007 07:03:26 -0400
David Relson <relson at osagesoftware.com> wrote:
> On Tue, 2 Oct 2007 04:35:21 +0100
> RW wrote:
> > Like I said, it's not my Bogofilter, it's the filter at
> > http://www.tuffmail.com. I've trained on error/unknown for a year,
> > and it's had about 250 ham + 350 spam.
> >
> > Irrespective of how Tuffmail has tuned it, it seems fundamentally
> > wrong that any Bayesian spam filter can produce an output of 0.996
> > based on a single token. Shouldn't there be some kind of sanity
> > check?
>
> Hello RW,
>
> Looking at the "-vvv" output, its apparent tuffmail has configured
> bogofilter to ignore tokens with scores between 0.164021 and 0.196600.
> This seems a bit extreme.
>
> Many email clients allow client side filtering, i.e. you can run
> bogofilter with _your_ parameters. If I'm reading the mail headers
> correctly, you are running Claws Mail 3.0.0 on a FreeBSD machine.
> I know that claws-mail allows the client side filtering above. Why
> not give it a try?
>
There are lots of reasons why I prefer having my mail on a remote
server, and my spam-filtering server-side, but I don't really want to
get into that. Tuffmail's Bogofilter actually does work very well, and
the overwhelming majority of spams/hams are identified on a large number
of tokens.
What bothers me about this is that it's an avoidable false-positive.
It arises because the edge-case isn't handled, in which the
token-selection rules don't leave enough information for a sensible
result.
IMO there are two sensible ways of handling this; either tinker with
the token selection so it always produces a minimum number of tokens,
or simply return 0.5 when there isn't enough information to work with.
More information about the Bogofilter
mailing list