New script to train bogofilter

Boris 'pi' Piwinger 3.14 at logic.univie.ac.at
Wed Jul 2 09:13:13 CEST 2003


Boris 'pi' Piwinger wrote:

> I wrote a perl script which trains bogofilter on error. It
> produces very small databases. We'll have to see how good
> that works. Any comments are warmly welcome.

I reran my script until I got no errors. It was still
extremely small: 352 spam and 291 ham

Then I started to use it. This is 24 hours ago now. I just
had one false negative (with over 100 spam messages
correctly classified) and no false positive.

So my first estimation: This works perfectly, we need far
less messages in the database than we thought before. There
seems to be no practical reason to avoid multiple
classification of the same message.

What does not work is bogotune. I took out the check for at
least 2000 messages each in the database and made two runs.
The first with -C (do not use config files) and the second
with my config:

Verifying training db /home/3.14/.bogofilter/goodlist.db...
Verifying training db /home/3.14/.bogofilter/spamlist.db...
Verifying test files...
Verification completed successfully.
Creating message-count files...
Message-count files bt3038.{sp,ns} created
Recommended cache size is 1 Mbytes.
Calculating false-positive target...
Very few high-scoring nonspams in this data set.
Use these settings (only min_dev may have changed):
robx        = 0.415000 (4.15e-01)
robs        = 0.010000 (1.00e-02)
min_dev     = 0.020000 (2.00e-02)
ham_cutoff  = 0.000000 (0.00e+00)
spam_cutoff = 0.950000 (9.50e-01)
Tuning aborted.

Verifying training db /home/3.14/.bogofilter/goodlist.db...
Verifying training db /home/3.14/.bogofilter/spamlist.db...
Verifying test files...
Verification completed successfully.
Recommended cache size is 1 Mbytes.
Calculating false-positive target...
Very few high-scoring nonspams in this data set.
Use these settings (only min_dev may have changed):
robx        = 0.520000 (5.20e-01)
robs        = 0.100000 (1.00e-01)
min_dev     = 0.020000 (2.00e-02)
ham_cutoff  = 0.000000 (0.00e+00)
spam_cutoff = 0.501000 (5.01e-01)
Tuning aborted.

So both did not return anything too useful. But we have two
things to fix in bogoutil:

1) The cache size is not listed with the results.

2) min_dev is changed, but by no means optimized.

pi





More information about the Bogofilter mailing list