Graham's method seemed better

Boris 'pi' Piwinger 3.14 at logic.univie.ac.at
Sat Nov 23 12:17:31 CET 2002


Greg Louis <glouis at dynamicro.on.ca> wrote:

>We can conclude that it is unlikely to be true that Graham's
>calculation method misses fewer spams than Robinson's, or than the
>variant of Robinson's that is based on Fisher's method of combining
>probabilities.

I certainly did not understand the complete analysis. Maybe
I should elaborate on how I had my experience:

I used the 0.8 release for the "in work comparison". I took
the training set Greg has now and rebuild the database from
scratch using -r (as well as using -r everywhere else I call
bogofilter). I use bogofilter as shown in my .procmailrc:
http://piology.org/.procmailrc.html

While this is working some spam is not caught by bogofilter,
but by the following rules by procmail. Other messages are
missed completely. Everything is manually checked by me and
then added to the training set (for ham only those messages
received at working hours, which should be statistically OK;
spam is completely added). Probably also insignificant is
that very large e-mails are not added to the ham collection.

For some days I was very annoyed by the poor performance.
I.e., the number of things in trash (caught by procmail
only) and worse the number of things not caught at all
required quite a few manual corrections using -S.

So I removed the -r everywhere and again rebuilt the
database with the now slightly bigger training set (grown
over only few days). I had (without being able to give
numbers) significantly less in trash and almost no uncaught
spams. Very satisfying.

For those not caught, base64 shows up a lot.

Major difference for the size of the training set is that
you used a third while I used the biggest part of the mails
you have and "tested" only with a normal workflow of
incoming mails.

Even though I cannot provide numbers, by saving manual
correction work I have something I want to call strong
evidence of an effect. But as you know I used bogofilter out
of the box and did not tweak anything.

I assume that MIME decoding would be *the* great improvement
to get me down to a situation where I have close to no
manual corrections.

pi

PS: Is any of the native speakers still shocked by the use
of a plural form of the word mail?



More information about the Bogofilter mailing list