bogotune claims too few messages despite >3000 ham and >5000 spam

David Relson relson at osagesoftware.com
Thu Jul 7 05:23:51 CEST 2011


Hello Jonathan,

Bogotune is complex and tricky.  For its operation, it requires a set
of messages (as you're aware) and the messages must follow some
constraints that are implemented in the code, but not readily
described in words.  The information you provide isn't sufficient to
diagnose the actual problem.

Bogotune's "-M" switch allows conversion of messages to the "message
count" format, i.e. tokens that are sorted and have duplicates removed.
This has the effect of anonymizing a message and obscuring its meaning.
Effectively this removes sensitive meaning from the messages. If you'd
care to use the "-M" switch and send me (off-list) a copy of the
resulting ham and spam files, I'll run bogotune and see if I can
determine why it's unhappy with your corpora.

Be warned that it's summer travel season so I may not be able to look
at your files right away.

Regards,

David

On Wed, 22 Jun 2011 21:23:01 -0400
Jonathan Kamens wrote:

> I'm trying to tune bogofilter and can't get bogotune to work
> reliably. Below is a transcript.
> 
> Why is bogotune having trouble locking onto good settings, when I've
> got 3121 ham messages and 5818 spam messages? Also, the settings I'm 
> currently using, which were generated by an earlier, successful run
> of bogotune about four months ago, are working just fine, with my
> spam detection rate at above 99%. I've noticed a few more false
> positives than I prefer when I receive email from new entities, which
> is why I'm trying to retune.
> 
> Before bogotune was having this particular problem, it was having 
> another one... it kept reporting that it couldn't read my
> wordlist.db. That problem went away after I used bogoutil to remove
> tokens from the word list that haven't been seen in 180 days, a
> maintenance task I do periodically to keep the size of the word list
> reasonable.
> 
> I've also recently dumped and reloaded the word list into a new file, 
> which brought its size down from 9MB to 3MB, but that didn't help
> bogotune.
> 
> Oh, and I should mention that I actively, regularly retrain
> bogofilter with ham and spam, including fixing any
> mischaracterizations, so my word list is extremely accurate.
> 
> Thanks for any advice you can provide.
> 
>    jik
> 
> $ bogotune -v -T 0 -n /tmp/notspam -s /tmp/bogospam
> Reading /home/jik/.bogofilter/wordlist.db
> Reading /tmp/notspam
> 3121 messages
> Reading /tmp/bogospam
> 5818 messages
> wordlist's ham to spam ratio is 0.9 to 1.0
> Calculating initial x value...
> Initial x value is 0.481636
> Recommended db cache size is 11 MB
> Too few high-scoring non-spams in this data set.
> At target 1, cutoff is 0.037362.
> False-positive target is 1 (cutoff 0.037362)
> Performing final scoring:
> Spam...  Non-Spam...
> 0.000000 0.037362
> 0.227615 0.033571
> 0.427900 0.030283
> 0.437421 0.025977
> 0.513999 0.024112
> 0.546457 0.020030
> 0.573949 0.010595
> 0.574819 0.003866
> 0.578010 0.003672
> 0.613205 0.000913
> 
> ### The following recommendations are provisional.
> ### Run bogotune with more messages when possible.
> 
> 
> Recommendations:
> 
> ---cut---
> db_cachesize=11
> robs=0.0178
> min_dev=0.020
> robx=0.481636
> sp_esf=1.000000
> ns_esf=1.000000
> spam_cutoff=0.033571    # for 0.05% fp (1); expect 0.02% fn (1).
> #spam_cutoff=0.025977   # for 0.10% fp (3); expect 0.02% fn (1).
> #spam_cutoff=0.010595   # for 0.20% fp (6); expect 0.02% fn (1).
> ham_cutoff=0.011
> ---cut---
> 
> The small number and/or relative uniformity of the test messages imply
> that the recommended values (above), though appropriate to the test
> set, may not remain valid for long.  Bogotune should be run again
> with more messages when that becomes possible.
> Tuning completed.
> 



More information about the Bogofilter mailing list