bogotune claims too few messages despite >3000 ham and >5000 spam

Jonathan Kamens jik at kamens.us
Thu Jun 23 03:23:01 CEST 2011


I'm trying to tune bogofilter and can't get bogotune to work reliably. 
Below is a transcript.

Why is bogotune having trouble locking onto good settings, when I've got 
3121 ham messages and 5818 spam messages? Also, the settings I'm 
currently using, which were generated by an earlier, successful run of 
bogotune about four months ago, are working just fine, with my spam 
detection rate at above 99%. I've noticed a few more false positives 
than I prefer when I receive email from new entities, which is why I'm 
trying to retune.

Before bogotune was having this particular problem, it was having 
another one... it kept reporting that it couldn't read my wordlist.db. 
That problem went away after I used bogoutil to remove tokens from the 
word list that haven't been seen in 180 days, a maintenance task I do 
periodically to keep the size of the word list reasonable.

I've also recently dumped and reloaded the word list into a new file, 
which brought its size down from 9MB to 3MB, but that didn't help bogotune.

Oh, and I should mention that I actively, regularly retrain bogofilter 
with ham and spam, including fixing any mischaracterizations, so my word 
list is extremely accurate.

Thanks for any advice you can provide.

   jik

$ bogotune -v -T 0 -n /tmp/notspam -s /tmp/bogospam
Reading /home/jik/.bogofilter/wordlist.db
Reading /tmp/notspam
3121 messages
Reading /tmp/bogospam
5818 messages
wordlist's ham to spam ratio is 0.9 to 1.0
Calculating initial x value...
Initial x value is 0.481636
Recommended db cache size is 11 MB
Too few high-scoring non-spams in this data set.
At target 1, cutoff is 0.037362.
False-positive target is 1 (cutoff 0.037362)
Performing final scoring:
Spam...  Non-Spam...
0.000000 0.037362
0.227615 0.033571
0.427900 0.030283
0.437421 0.025977
0.513999 0.024112
0.546457 0.020030
0.573949 0.010595
0.574819 0.003866
0.578010 0.003672
0.613205 0.000913

### The following recommendations are provisional.
### Run bogotune with more messages when possible.


Recommendations:

---cut---
db_cachesize=11
robs=0.0178
min_dev=0.020
robx=0.481636
sp_esf=1.000000
ns_esf=1.000000
spam_cutoff=0.033571    # for 0.05% fp (1); expect 0.02% fn (1).
#spam_cutoff=0.025977   # for 0.10% fp (3); expect 0.02% fn (1).
#spam_cutoff=0.010595   # for 0.20% fp (6); expect 0.02% fn (1).
ham_cutoff=0.011
---cut---

The small number and/or relative uniformity of the test messages imply
that the recommended values (above), though appropriate to the test set,
may not remain valid for long.  Bogotune should be run again with more
messages when that becomes possible.
Tuning completed.



More information about the Bogofilter mailing list