Bogofilter Tuning Issues...

Greg Louis glouis at dynamicro.on.ca
Thu May 1 13:54:26 CEST 2003


On 20030430 (Wed) at 1012:15 -0400, David Relson wrote:
> Hi Chris,
> 
> Good questions !

And they shall be taken into account in the next revision of the HOWTO. 
This kind of "I was left wondering" is really valuable feedback,
thanks!

The suggestions David puts forth are all well worthwhile.  I have a few
further comments:
 
> At 09:31 AM 4/30/03, Chris Ditri wrote:

> >1) It is indicated that an Ideal Robinson X is somewhere around .5.  What
> >should we do if it is nowhere near .5?  Start the DB's from scratch (Ack!)?

Even out the sizes, in the meantime sticking to 0.415 as David
suggests, and try again.

> >I have watching my bogofilter accruacy slowly dwindle over the past
> >few weeks. According to the howto, this is likely because my
> >goodlist is twice the size of my spamlist (which likely explains my
> >robinson x of .25).  It likewise says not to use the -u option...
> >but in the man page the example that it gives uses it!

Goodlist 2x spamlist might well explain the low ROBX but not
necessarily the change in accuracy.  Many factors could contribute,
including a change in spammers' tactics that I think most of us have
seen (my bogofilter's accuracy is improving again after a bit of a
slump).  I do think you should quit expanding the goodlist (except for
errors and unsures) and continue adding all spams to get them to even
out.

> >If we shouldn't use the procmail example in the man page (which uses the -u
> >option), what would you suggest be the recommended procmailrc?  And,
> >shouldn't this replace the example in the man page?

I wrote the HOWTO but not the man page, and I do things the way it says
in the HOWTO, which may not (yet? :) be official policy... I'll add my
procmail recipe, which is just
  :0
  * RECIPIENT ?? (list|of|users|who|opted|out)
  { }
  :0HBE:
  * ? bogofilter -L "$SENDER" -d /ha/.bogofilter
  $SPAM_QUARANTINE

If you don't have any opted-out users, omit the first recipe and change
the :0HBE: of the second to be :0HB:  If you don't want the sender
logged, omit -L "$SENDER" (obviously).  Also adjust the -d option to
suit; it may be unnecessary in your case.  SPAM_QUARANTINE should be
defined earlier in the .procmailrc file.

> >Important question:  Is there a way to restore this balance without
> >having to totally retrain the filter?
Yes, see above.

> >It also suggests after getting about 5000 messages that we should
> >only teach it about errors and unsures.  So basically you are saying
> >that updates need to be done essentially by hand, since the only way
> >to hear about "errors" is through account users and the only way to
> >be certain of the value of "unsures" is to read them.  True?

Almost true.  The suggestion kicks in after you get about 5000 _each_
of spam and nonspam.  Manual training is required, yes, but it can be
facilitated:  I've been saving copies of all mail and periodically
"hand" classifying them for training -- I classify with bogofilter (my
copy of bogofilter supports this by returning a separate exit code for
unsure, but you can grep the X-Bogosity header if you use the
distributed version) and then correct the results manually, then train;
the users haven't been involved.  This is getting onerous, so we're
going to try operating with one training database but individual spam
quarantines, and giving the users the opportunity to provide feedback by
bouncing messages.  If we don't get into too many newsletter fights,
this will become the new m.o.

David wrote:
> I have procmail put all messages with "X-Bogosity: Unsure" in a 
> /var/spool/mail/bogofilter-unsure and have Eudors put them in a 
> "bogofilter-unsure" folder.  Just scanning the return address and the 
> subject I can generally tell how a message should have been 
> classified.  Given that judgement, I switch to my mail server, get the 
> unsure message, and save it as "us.MMDD.HHMM.txt" or "ug.MMDD.HHMM.txt" 
> (where "us" means "unsure, should be spam" and "ug" means "unsure, should 
> be good").  A cron job runs each hour and run "bogofilter -s" or 
> "bogofilter -n" as appropriate.

Sounds good to me -- very similar to what I described above.

-- 
| G r e g  L o u i s          | gpg public key: finger     |
|   http://www.bgl.nu/~glouis |   glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |




More information about the Bogofilter mailing list