tuning 0.10.1.1

Greg Louis glouis at dynamicro.on.ca
Sun Jan 26 19:41:31 CET 2003


This is a very informal report of a preliminary session tuning
bogofilter 0.10.1.1 with Robinson's f(w) calculation and Fisher's
method for combining probabilities (aka Robinson-Fisher).

The test corpus was a week's collected email to my home mail server.
I've been using a variant of bogofilter 0.8.0 and have found that good
results were being obtained with
  min_dev = 0.1
  spam_cutoff = 0.99
  ham_cutoff = 0.1
  robs = 5.0e-7
  robx = 0.415

This setup gave the following results with the test corpus: 3772
correctly identified nonspam, 135 (3.5%) unsure nonspam, 187 correctly
identified spam and 14 (6.9%) unsure spam.  In binary terms, therefore,
we had no false positives but delivered 14 spams.

I rebuilt my training database in another directory with 0.10.1.1,
kill_html_comments enabled, datestamps disabled.  I didn't recalculate
robx, nor did I adjust the nonspam cutoff, but I did a quick tuning job
on the other three parameters with the aid of a tiny corpus consisting
of 20 spams and 20 spammy-looking nonspams.  What I came up with was
  min_dev = 0.25
  spam_cutoff = 0.985
  ham_cutoff = 0.1 (not tuned)
  robs = 0.0035
  robx = 0.415 (not tuned)

With this, I reclassified the test corpus and got quite nice results:
3654 correctly identified nonspam, 253 (6.4%) unsure nonspam, 194
correctly identified spam and 7(3.4%) unsure spam.

With the mime processing, I'm getting about 60% more tokens in the
training db's spamlist than were present in the 0.8.0 training db. 
This has an unfortunate downside: lookup times are extremely long. The
first 150,000 tokens entered into an empty list took 31 seconds to
process; but the first 500,000 tokens entered into a separate list,
also starting from empty, took 13 minutes and 13 seconds.  Classifying
an individual email with the 500,000-token spamlist and the
150,000-token goodlist can take several hundred milliseconds, and
registering new spam messages on top of what's there now takes around
700ms each (new nonspams are taking about 25 ms each to register).

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |
| Help free our mailboxes. Include                   |
|        http://wecanstopspam.org in your signature. |




More information about the Bogofilter mailing list