Result Based on a Single Token

RW fbsd06 at mlists.homeunix.com
Tue Oct 2 19:47:48 CEST 2007


On Tue, 2 Oct 2007 18:03:49 +0100
John G Walker <johngeoffreywalker at yahoo.co.uk> wrote:

> 
> 
> On Tue, 2 Oct 2007 17:35:07 +0100 RW <fbsd06 at mlists.homeunix.com>
> wrote:
> 
> > The reason why this particular mail was detected as spam is that I
> > don't train on all unsure result in mailing lists.
> 
> That's your problem, then.
> 
> > The reason I don't learn all unsure mails in mailing lists is that
> > mailing lists are one of the few cases where spammers have access to
> > high-quality ham text, and I'm concerned that one day they may
> > exploit that. Consequently I don't like to let lists dominate my ham
> > corpus. 
> 
> If you try to pick and choose which observations go into a Bayesian
> (or, indeed, classical statistics) database then you get screwy
> results.


Personally, I believe this is a bug, and the details of how I triggered
it are immaterial.  Software should be tolerant of misuse, and should
fail-safe.

> That's the nature of statistics. You have to throw in everything or it
> doesn't work. Period.
> 

And a Bayesian spam filter should not be capable of designating an
email as spam based on a single token. Period.





More information about the Bogofilter mailing list