no filtering without ham

David Relson relson at osagesoftware.com
Sat Jan 17 15:26:30 CET 2009


On Sat, 17 Jan 2009 13:47:30 +0100
Sven Burmeister wrote:

> Hello everybody!
> 
> I use bogofilter with kmail and it works very well. However there is
> an issue that comes up quite frequently with people who start using
> kmail+bogofilter for the first time.
> 
> The issue is that bogofilter seems to only start marking as spam
> after the user has marked some emails as ham.
> 
> >From the questions I answered for new users, their workflow is
> >something like 
> this:
> 
> - Use kmail's anti-spam wizard to set-up filters. This means that
> spam will be moved into some other folder than the inbox.
> - Wait for spam and mark it as such to train bogofilter.
> 
> They do not mark emails as ham, because there is no need to, since
> they were not moved into the spam folder, i.e. for them, as long as
> it is not falsely marked as spam, there is no need to tell bogofilter
> that it did something wrong.
> 
> As a result, bogofilter never starts to sort out spam emails, no
> matter how long they train it by marking spam emails. Hence for them
> bogofilter+kmail do not work.
> 
> There are several ways to work around this issue.
> 
> - Kmail could pipe some bogus mails through bogofilter as ham, to
> feed the wordlist.
> - Kmail could mark all incoming mails as ham
> - Kmail could display a message when setting up filters that tell the
> user to mark emails as spam and ham, since otherwise bogofilter will
> not work.
> - bogofilter could start working after it was trained with spam
> emails only and get the ham stats after that, when users mark false
> positives.
> 
> Before I talk to the kmail devs, are you aware of this issue and what 
> "solution" would you prefer? Maybe you have an even better approach,
> so feel free to comment and add.
> 
> Sven

Hello Sven,

"No filtering without ham" is exactly right!  A bayesian filter's job
is distinguishing "good" from "bad" and it does this by comparing a new
message to what it has been told "good" and "bad" mean.  Without
knowledge of _both_ good and bad, it cannot do its job.

Your strategies are, generally, reasonable.  I don't like the "pipe
bogus mail" strategy, but the others are fine -- assuming the user is
informed of the need to deal with classification errors -- of which
there will be plenty until bogofilter has a reasonable wordlist.

Regards,

David



More information about the Bogofilter mailing list