Bogofilter Tuning Issues...

Chris Ditri chrisd at better-investing.org
Wed Apr 30 17:44:25 CEST 2003


Thanks for the Advice, David.

I too have been keeping daily copies of unsure spam.  I am just trying to 
think of the best way to autmoate this thing, and I have a pretty good idea 
what I am going to do now.

So now I just need to find the best way to shrink my good list... I was 
thinking of using the -m option for bogoutil and telling it to kill older 
entries (say over 45 days old) and if that is not enough then just 
incrementing the vaule until I get them about even... What do you think of 
that idea?  Is there a better way?  I think I need to do this because my 
false negatives are getting uncomfortably high.

Thanks so much for your help David!


Chris


On Wednesday 30 April 2003 10:12 am, David Relson wrote:
> Hi Chris,
>
> Good questions !
>
> First, a bit of info.  0.12.0 and 0.12.1 were released last week, but
> bogofilter-announce was down, so the usual notice didn't go out.  On
> SourceForge, you _can_ get notification whenever bogofilter-current or
> bogofilter-stable is updated.  However relatively few people make use of
> that capability.  The good news is that 0.12.2 is building as I write this
> message.  I expect it to be available within the hur.
>
> Questions 2 & 3 are easy.  The parameters can be set via command line
> and/or config file.  Look for switches '-m' and config file options "robs="
> and "min_dev=".  Switch '-o' and options "spam_cutoff=" and "ham_cutoff="
> are also related.
>
> In 0.11.x "-qv" will display the values.  For 0.12.x use "-Q".
>
> I'd recommend a two part attack on database size.  First enable
> "ham_cutoff=0.1" in your bogofilter.cf file.  This will activate tristate
> mode in which messages are labeled as "Yes", "No", and "Unsure".  This will
> enable you to easily find the messages that bogofilter couldn't classify
> within its level of certainty (as determined by the spam_cutoff and
> ham_cutoff values).  Then manually train bogofilter with all the Unsures,
> as well as any false positives or false negatives that occur.
>
> Also, if you've been using '-u', I hope you've been checking the results
> and correcting any mistakes.  If you choose to follow the suggestion in the
> above paragraph, remove the "-u" from your procmail recipe.
>
> Regarding robx, I can't say a whole lot as I've not done any experiments to
> see how different values for it affect how well bogofilter does its
> job.  I'm still using the default value of 0.415 and am pleased with the
> results.
>
> At 09:31 AM 4/30/03, Chris Ditri wrote:
> >Hello again,
> >
> >I really appreciate the effort put forth in the bogofilter-tuning HOWTO.
> >While I found it generally illuminating, there are few things I am left
> >without answers to:
> >
> >1) It is indicated that an Ideal Robinson X is somewhere around .5.  What
> >should we do if it is nowhere near .5?  Start the DB's from scratch
> > (Ack!)?
> >
> >2) It is suggested that Robinson s should start .1, but I don't see any
> >suggestions on how to change this value -- or even see how to obatin  a
> >print-out of your current s value.
> >
> >3) Ditto with the MIN_DEV value and the suggested .2-.25 value.
> >
> >I have watching my bogofilter accruacy slowly dwindle over the past few
> >weeks.
> >According to the howto, this is likely because my goodlist is twice the
> > size of my spamlist (which likely explains my robinson x of .25).  It
> > likewise says not to use the -u option... but in the man page the example
> > that it gives uses it!
> >
> >If we shouldn't use the procmail example in the man page (which uses the
> > -u option), what would you suggest be the recommended procmailrc?  And,
> > shouldn't this replace the example in the man page?
> >
> >Important question:  Is there a way to restore this balance without having
> > to totally retrain the filter?
> >
> >It also suggests after getting about 5000 messages that we should only
> > teach it about errors and unsures.  So basically you are saying that
> > updates need to be done essentially by hand, since the only way to hear
> > about "errors" is through account users and the only way to be certain of
> > the value of "unsures" is to read them.  True?
>
> I have procmail put all messages with "X-Bogosity: Unsure" in a
> /var/spool/mail/bogofilter-unsure and have Eudors put them in a
> "bogofilter-unsure" folder.  Just scanning the return address and the
> subject I can generally tell how a message should have been
> classified.  Given that judgement, I switch to my mail server, get the
> unsure message, and save it as "us.MMDD.HHMM.txt" or "ug.MMDD.HHMM.txt"
> (where "us" means "unsure, should be spam" and "ug" means "unsure, should
> be good").  A cron job runs each hour and run "bogofilter -s" or
> "bogofilter -n" as appropriate.
>
> I also correct the (very rare) false positives and false negatives by
> creating "sh.MMDD.HHMM.txt" and "hs.MMDD.HHMM.txt" and running "bogofilter
> -Sn" or "bogofilter -Ns".
>
> >I am currently using .11.1.6, and plan on upgrading to 11.2 (unless it is
> >recommended I go right to 12.1....)
>
> Get 0.12.2 (which will be announced in a few minutes).
>
> >I appreciate your time.  Thanks!
>
> Hope this helps.
>
> Good luck!





More information about the Bogofilter mailing list