Result Based on a Single Token
Tom Anderson
tanderso at oac-design.com
Tue Oct 2 17:07:47 CEST 2007
RW wrote:
> There are lots of reasons why I prefer having my mail on a remote
> server, and my spam-filtering server-side, but I don't really want to
> get into that. Tuffmail's Bogofilter actually does work very well, and
> the overwhelming majority of spams/hams are identified on a large number
> of tokens.
>
> What bothers me about this is that it's an avoidable false-positive.
> It arises because the edge-case isn't handled, in which the
> token-selection rules don't leave enough information for a sensible
> result.
>
> IMO there are two sensible ways of handling this; either tinker with
> the token selection so it always produces a minimum number of tokens,
> or simply return 0.5 when there isn't enough information to work with.
The filter needs to be trained on errors. There aren't nearly enough
emails in the database to give "enough information". If it were trained
better, a single errant token would in fact produce a very neutral
result. Adjusting your cutoffs and min-dev range would also affect
this. I know you don't have direct access to this system, but it is
still the truth of the matter. Contact your administrator, or run your
own server, or just accept the occasional false-positive. I see no
reason why bogofilter ought to be modified to account for a case that
shouldn't exist. The wordlist MUST be a constantly evolving set, not
something that is just trained once and then ignored. Statistical
filtering simply doesn't work that way.
Tom
More information about the Bogofilter
mailing list