Result Based on a Single Token

Tue Oct 2 17:07:47 CEST 2007

RW wrote:
> There are lots of reasons why I prefer having my mail on a remote
> server, and my spam-filtering server-side, but I don't really want to
> get into that. Tuffmail's Bogofilter actually does work very well, and
> the overwhelming majority of spams/hams are identified on a large number
> of tokens.
> 
> What bothers  me about this is that it's an avoidable false-positive.
> It arises because the edge-case isn't handled, in which the
> token-selection rules don't leave enough information for a sensible
> result.
> 
> IMO there are two sensible ways of handling this; either tinker with
> the token selection so it always produces a minimum number of tokens,
> or simply return 0.5 when there isn't enough information to work with.

The filter needs to be trained on errors.  There aren't nearly enough 
emails in the database to give "enough information".  If it were trained 
better, a single errant token would in fact produce a very neutral 
result.  Adjusting your cutoffs and min-dev range would also affect 
this.  I know you don't have direct access to this system, but it is 
still the truth of the matter.  Contact your administrator, or run your 
own server, or just accept the occasional false-positive.  I see no 
reason why bogofilter ought to be modified to account for a case that 
shouldn't exist.  The wordlist MUST be a constantly evolving set, not 
something that is just trained once and then ignored.  Statistical 
filtering simply doesn't work that way.

Tom