Bogofilter Tuning Issues...

David Relson relson at osagesoftware.com
Wed Apr 30 16:12:15 CEST 2003


Hi Chris,

Good questions !

First, a bit of info.  0.12.0 and 0.12.1 were released last week, but 
bogofilter-announce was down, so the usual notice didn't go out.  On 
SourceForge, you _can_ get notification whenever bogofilter-current or 
bogofilter-stable is updated.  However relatively few people make use of 
that capability.  The good news is that 0.12.2 is building as I write this 
message.  I expect it to be available within the hur.

Questions 2 & 3 are easy.  The parameters can be set via command line 
and/or config file.  Look for switches '-m' and config file options "robs=" 
and "min_dev=".  Switch '-o' and options "spam_cutoff=" and "ham_cutoff=" 
are also related.

In 0.11.x "-qv" will display the values.  For 0.12.x use "-Q".

I'd recommend a two part attack on database size.  First enable 
"ham_cutoff=0.1" in your bogofilter.cf file.  This will activate tristate 
mode in which messages are labeled as "Yes", "No", and "Unsure".  This will 
enable you to easily find the messages that bogofilter couldn't classify 
within its level of certainty (as determined by the spam_cutoff and 
ham_cutoff values).  Then manually train bogofilter with all the Unsures, 
as well as any false positives or false negatives that occur.

Also, if you've been using '-u', I hope you've been checking the results 
and correcting any mistakes.  If you choose to follow the suggestion in the 
above paragraph, remove the "-u" from your procmail recipe.

Regarding robx, I can't say a whole lot as I've not done any experiments to 
see how different values for it affect how well bogofilter does its 
job.  I'm still using the default value of 0.415 and am pleased with the 
results.

At 09:31 AM 4/30/03, Chris Ditri wrote:

>Hello again,
>
>I really appreciate the effort put forth in the bogofilter-tuning HOWTO.
>While I found it generally illuminating, there are few things I am left
>without answers to:
>
>1) It is indicated that an Ideal Robinson X is somewhere around .5.  What
>should we do if it is nowhere near .5?  Start the DB's from scratch (Ack!)?
>
>2) It is suggested that Robinson s should start .1, but I don't see any
>suggestions on how to change this value -- or even see how to obatin  a
>print-out of your current s value.
>
>3) Ditto with the MIN_DEV value and the suggested .2-.25 value.
>
>I have watching my bogofilter accruacy slowly dwindle over the past few 
>weeks.
>According to the howto, this is likely because my goodlist is twice the size
>of my spamlist (which likely explains my robinson x of .25).  It likewise
>says not to use the -u option... but in the man page the example that it
>gives uses it!
>
>If we shouldn't use the procmail example in the man page (which uses the -u
>option), what would you suggest be the recommended procmailrc?  And,
>shouldn't this replace the example in the man page?
>
>Important question:  Is there a way to restore this balance without having to
>totally retrain the filter?
>
>It also suggests after getting about 5000 messages that we should only teach
>it about errors and unsures.  So basically you are saying that updates need
>to be done essentially by hand, since the only way to hear about "errors" is
>through account users and the only way to be certain of the value of
>"unsures" is to read them.  True?

I have procmail put all messages with "X-Bogosity: Unsure" in a 
/var/spool/mail/bogofilter-unsure and have Eudors put them in a 
"bogofilter-unsure" folder.  Just scanning the return address and the 
subject I can generally tell how a message should have been 
classified.  Given that judgement, I switch to my mail server, get the 
unsure message, and save it as "us.MMDD.HHMM.txt" or "ug.MMDD.HHMM.txt" 
(where "us" means "unsure, should be spam" and "ug" means "unsure, should 
be good").  A cron job runs each hour and run "bogofilter -s" or 
"bogofilter -n" as appropriate.

I also correct the (very rare) false positives and false negatives by 
creating "sh.MMDD.HHMM.txt" and "hs.MMDD.HHMM.txt" and running "bogofilter 
-Sn" or "bogofilter -Ns".

>I am currently using .11.1.6, and plan on upgrading to 11.2 (unless it is
>recommended I go right to 12.1....)

Get 0.12.2 (which will be announced in a few minutes).


>I appreciate your time.  Thanks!

Hope this helps.

Good luck!





More information about the Bogofilter mailing list