Bogofilter Tuning Issues...
Chris Ditri
chrisd at better-investing.org
Wed Apr 30 17:44:25 CEST 2003
Thanks for the Advice, David.
I too have been keeping daily copies of unsure spam. I am just trying to
think of the best way to autmoate this thing, and I have a pretty good idea
what I am going to do now.
So now I just need to find the best way to shrink my good list... I was
thinking of using the -m option for bogoutil and telling it to kill older
entries (say over 45 days old) and if that is not enough then just
incrementing the vaule until I get them about even... What do you think of
that idea? Is there a better way? I think I need to do this because my
false negatives are getting uncomfortably high.
Thanks so much for your help David!
Chris
On Wednesday 30 April 2003 10:12 am, David Relson wrote:
> Hi Chris,
>
> Good questions !
>
> First, a bit of info. 0.12.0 and 0.12.1 were released last week, but
> bogofilter-announce was down, so the usual notice didn't go out. On
> SourceForge, you _can_ get notification whenever bogofilter-current or
> bogofilter-stable is updated. However relatively few people make use of
> that capability. The good news is that 0.12.2 is building as I write this
> message. I expect it to be available within the hur.
>
> Questions 2 & 3 are easy. The parameters can be set via command line
> and/or config file. Look for switches '-m' and config file options "robs="
> and "min_dev=". Switch '-o' and options "spam_cutoff=" and "ham_cutoff="
> are also related.
>
> In 0.11.x "-qv" will display the values. For 0.12.x use "-Q".
>
> I'd recommend a two part attack on database size. First enable
> "ham_cutoff=0.1" in your bogofilter.cf file. This will activate tristate
> mode in which messages are labeled as "Yes", "No", and "Unsure". This will
> enable you to easily find the messages that bogofilter couldn't classify
> within its level of certainty (as determined by the spam_cutoff and
> ham_cutoff values). Then manually train bogofilter with all the Unsures,
> as well as any false positives or false negatives that occur.
>
> Also, if you've been using '-u', I hope you've been checking the results
> and correcting any mistakes. If you choose to follow the suggestion in the
> above paragraph, remove the "-u" from your procmail recipe.
>
> Regarding robx, I can't say a whole lot as I've not done any experiments to
> see how different values for it affect how well bogofilter does its
> job. I'm still using the default value of 0.415 and am pleased with the
> results.
>
> At 09:31 AM 4/30/03, Chris Ditri wrote:
> >Hello again,
> >
> >I really appreciate the effort put forth in the bogofilter-tuning HOWTO.
> >While I found it generally illuminating, there are few things I am left
> >without answers to:
> >
> >1) It is indicated that an Ideal Robinson X is somewhere around .5. What
> >should we do if it is nowhere near .5? Start the DB's from scratch
> > (Ack!)?
> >
> >2) It is suggested that Robinson s should start .1, but I don't see any
> >suggestions on how to change this value -- or even see how to obatin a
> >print-out of your current s value.
> >
> >3) Ditto with the MIN_DEV value and the suggested .2-.25 value.
> >
> >I have watching my bogofilter accruacy slowly dwindle over the past few
> >weeks.
> >According to the howto, this is likely because my goodlist is twice the
> > size of my spamlist (which likely explains my robinson x of .25). It
> > likewise says not to use the -u option... but in the man page the example
> > that it gives uses it!
> >
> >If we shouldn't use the procmail example in the man page (which uses the
> > -u option), what would you suggest be the recommended procmailrc? And,
> > shouldn't this replace the example in the man page?
> >
> >Important question: Is there a way to restore this balance without having
> > to totally retrain the filter?
> >
> >It also suggests after getting about 5000 messages that we should only
> > teach it about errors and unsures. So basically you are saying that
> > updates need to be done essentially by hand, since the only way to hear
> > about "errors" is through account users and the only way to be certain of
> > the value of "unsures" is to read them. True?
>
> I have procmail put all messages with "X-Bogosity: Unsure" in a
> /var/spool/mail/bogofilter-unsure and have Eudors put them in a
> "bogofilter-unsure" folder. Just scanning the return address and the
> subject I can generally tell how a message should have been
> classified. Given that judgement, I switch to my mail server, get the
> unsure message, and save it as "us.MMDD.HHMM.txt" or "ug.MMDD.HHMM.txt"
> (where "us" means "unsure, should be spam" and "ug" means "unsure, should
> be good"). A cron job runs each hour and run "bogofilter -s" or
> "bogofilter -n" as appropriate.
>
> I also correct the (very rare) false positives and false negatives by
> creating "sh.MMDD.HHMM.txt" and "hs.MMDD.HHMM.txt" and running "bogofilter
> -Sn" or "bogofilter -Ns".
>
> >I am currently using .11.1.6, and plan on upgrading to 11.2 (unless it is
> >recommended I go right to 12.1....)
>
> Get 0.12.2 (which will be announced in a few minutes).
>
> >I appreciate your time. Thanks!
>
> Hope this helps.
>
> Good luck!
More information about the Bogofilter
mailing list