Bogofilter for general filesystem classification

Greg Louis glouis at dynamicro.on.ca
Sun Sep 14 13:41:23 CEST 2003


On 20030914 (Sun) at 1327:02 +1000, Ben Martin wrote:
> > recommend creating your test corpora and running bogotune (which is in
> > the bogofilter/tuning subdirectory) to determine parameters that fit
> > _your_ mix of messages (files).
> 
> Hmm, does bogotune do very bad things if there are less than 2000 spam
> and 2000 ham in the database?

[author leaps in here]

Depends whether you think returning worthless recommendations is a very
bad thing.  I'm not saying 1999 spam and 1999 nonspam gives garbage
while you can bet your life on 2001 spam and 2001 nonspam.  I am saying
that the likelihood that bogotune's results are worth applying in
production increases asymptotically toward max; as a rough guide, the
2,000 figure is probably somewhere around the 80% point, with 10,000
around 98%.  As always, ymmv -- that's why bogotune was written in the
first place.

> Obviously having a --allow-much-less-optimal-tuning option for bogotune
> would be the way so that the user knew that they held a large gun at
> their foot by using the option. Apart from that then I think the rough
> stab at values for cutoffs would be the only other option.

I have a useful little file called test20 -- it's simply an mbox with
20 spam and 20 nonspam.  It has two purposes; the main use is to do a
quick sanity check on newly compiled bogofilter, but the other use is
precisely to serve in a poor man's bogotune -- running bogofilter
manually with much use of the -m and -o options to get a quick rough
idea of appropriate parameter settings.  More times than not, one will
end up at a local maximum by doing tuning this way, but unless the
values are extreme (mindev > 0.44, for example), that maximum is
probably "good enough" -- and as good as you'd be likely to get by
applying bogotune to a tiny sample size.

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |




More information about the Bogofilter mailing list