no filtering without ham

Matthias Andree matthias.andree at gmx.de
Mon Jan 19 10:39:21 CET 2009


On Sun, 18 Jan 2009, jkinz at kinz.org wrote:

> We all know the problem of pre-training bogo. One mans ham is
> another man spam, and this is likely to be insurmountable.

That's indeed the case, and usually the paths where I receive much spam
(I have a @web.de address that is sort-of burnt with a spam:ham ratio in
excess of 1000:1) reflect in the database (recorded as Received and
other headers). It helps me, but will be useless for others.

Then, I find depending on the pre-filtering (if any) that providers do,
that the spam characteristics received at different accounts vary.

> Additionally the user education issue is likewise insurmountable.

:-)

> However these two insurmountables become trivial if your
> performance level requirement is changed from :
> 
> "Never do harm" [the current choice]
> 
> to
> 
> "Make some kind of reasonable choice of spam/ham and do our 
> best to inform the users about it and how they can override it"

I'd tend to think this is a matter of the user interface. That would be
KMail in this case. The advantage at least is that it can very easily
re-filter after training, and can also provide "unsure" ratings if KMail
is flexible enough.

When looking at user interfaces, I'd tend to look at Thunderbird. IMO
this is easy enough, it learns quickly initially and doesn't need the
"unsure" rating.

> My idea - add a second installation choice to the Bogo package.
> This one would come pre-trained with a select population of
> spam and ham. when installing this version the user is
> responsible for any retraining they need done. 

As alluded to above and in your one man's ham, another one's spam; there
is no standard spam corpus I'd consider useful. SpamAssassin or other
preconfigured filters can be useful for some initial training though.

> This second choice could be done very simplistically by adding
> the spam and ham file sets as raw email collections, not bogo
> db's, and a script the user can run to train their bogo install 
> using them. 

Well, data base dumps to load with a script around bogoutil might also
be useful and will certainly be faster than bogofilter.

However, there seems to be an enormous difference in how knowledgeable
users of different front-ends are. Those who set up bogofilter in
maildrop or procmail or Emacs-with-Gnus have some skills already and are
IMHO in a different class than those using bogofilter buried under
Evolution's or KMail's hood.

And I'd tend to claim that if some user interfaces promises user
guidance (such as easy integration with just clicking through a wizard),
it's supposed to take the user all the way and put a red flag "not
enough [ham][&][spam] trained yet" up rather than configure just the
filter, but then leave the user alone. The very least would be to open
the help file at the right page...

> This keeps the original "pure" install intact unless the user
> runs that script.
> 
> Additionally it keeps the new install delivery very simple and
> allows the user a very simple "push one button" mechanism to
> get the benefit of it.
> 
> Another benefit - it acts as a nice smoke test.  If the training
> doesn't work, then the install is broken.

We have "make check" and it's a long time since I've heard from broken
/installs/. Configuration might be a different issue, but we don't save
such with this kind of smoke test.

-- 
Matthias Andree



More information about the Bogofilter mailing list