Result Based on a Single Token

Tue Oct 2 18:35:07 CEST 2007

On Tue, 02 Oct 2007 11:07:47 -0400
Tom Anderson <tanderso at oac-design.com> wrote:

> RW wrote:
> > There are lots of reasons why I prefer having my mail on a remote
> > server, and my spam-filtering server-side, but I don't really want
> > to get into that. Tuffmail's Bogofilter actually does work very
> > well, and the overwhelming majority of spams/hams are identified on
> > a large number of tokens.
> > 
> > What bothers  me about this is that it's an avoidable
> > false-positive. It arises because the edge-case isn't handled, in
> > which the token-selection rules don't leave enough information for
> > a sensible result.
> > 
> > IMO there are two sensible ways of handling this; either tinker with
> > the token selection so it always produces a minimum number of
> > tokens, or simply return 0.5 when there isn't enough information to
> > work with.
> 

> There aren't nearly enough 
> emails in the database to give "enough information".  If it were
> trained better, a single errant token would in fact produce a very
> neutral result.  Adjusting your cutoffs and min-dev range would also
> affect this.  I know you don't have direct access to this system, but
> it is still the truth of the matter.  Contact your administrator, or
> run your own server, or just accept the occasional false-positive.  I
> see no reason why bogofilter ought to be modified to account for a
> case that shouldn't exist.  

My point is that this is a edge case that Bogofilter is not
handling properly. I'm surprised that anyone would be happy simply to
reduce the probability of hitting the bug. YMMV

The reason why this particular mail was detected as spam is that I
don't train on all unsure result in mailing lists. So this one-line
message had a huge number of header-based ham tokens at around 0.2 and
one 0.995766 spam token from the subject, and only the latter token
was selected. The reason I don't learn all unsure mails in mailing
lists is that mailing lists are one of the few cases where spammers
have access to high-quality ham text, and I'm concerned that one day
they may exploit that. Consequently I don't like to let lists dominate
my ham corpus. 

You say that there aren't nearly enough emails in the database, but as
far as my personal mail is concerned, I've put every single
unsure/mislabeled email into that corpus for a year (except giff
spams). It may take me 20 years to accumulate what you would call
enough. I don't think it's unreasonable to expect Bogofilter to behave
gracefully in those initial decades of training.