no filtering without ham

Matthias Andree matthias.andree at gmx.de
Sun Jan 18 04:20:42 CET 2009


On Sat, 17 Jan 2009, Sven Burmeister wrote:

> I use bogofilter with kmail and it works very well. However there is an issue 
> that comes up quite frequently with people who start using kmail+bogofilter 
> for the first time.
> 
> The issue is that bogofilter seems to only start marking as spam after the 
> user has marked some emails as ham.

Right. See David's message.

> >>From the questions I answered for new users, their workflow is something like 
> this:
> 
> - Use kmail's anti-spam wizard to set-up filters. This means that spam will be 
> moved into some other folder than the inbox.
> - Wait for spam and mark it as such to train bogofilter.
> 
> They do not mark emails as ham, because there is no need to, since they were 
> not moved into the spam folder, i.e. for them, as long as it is not falsely 
> marked as spam, there is no need to tell bogofilter that it did something 
> wrong.

So another suggestion would be to set KMail up in a different way:

a - let kmail use bogofilter in three-state mode (ham/spam/unsure) -
  that's the default anyways

b - then have kmail file unsure messages into an unsure folder

c - make sure that after training "ham", kmail re-runs all other filters
so that messages end up in the right folder -- so as not to piss users
off that have a large stack of custom filters.

That way, people have an incentive to register ham, because messages end
up in "unsure".

> - Kmail could pipe some bogus mails through bogofilter as ham, to feed the 
> wordlist.

I'd discourage that. It unnecessarily pollutes the database.

> - Kmail could mark all incoming mails as ham

Possible, but if users do not exhaustively mark EVERY spam as such, also
in other folders they might read less often, you again end up with a
polluted database that turns out useless.

> - Kmail could display a message when setting up filters that tell the user to 
> mark emails as spam and ham, since otherwise bogofilter will not work.

That's a rather good idea. Kmail might want to display this warning as
long as either the absolute spam or ham count is low, and/or if there is
a considerable mismatch between these counts, say, lower than 1:5 or
higher than 5:1.

> - bogofilter could start working after it was trained with spam emails only 
> and get the ham stats after that, when users mark false positives.

Conceivable (with a new command line option, we do not want to change
default behaviour), but I haven't thought about the consequences yet.
The implementation might just consider all tokens that bogofilter does
not know "ham" - but we'll probably need to tune the default parameters
in such a mode, and we might have to change the algorithm as well -- I
never looked too closely at it. AFAIR from the cursory glances I had at
the discussions, it's Bayesian estimation for each individual (or
compound, if so configured) token with a χ² (chi-square) test to make up
the overall spamicity.  I'm not this is all feasible or sensible.

HTH

Matthias

-- 
Matthias Andree
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20090118/6031250a/attachment.sig>


More information about the Bogofilter mailing list