Bogofilter Tuning Issues...

David Relson relson at osagesoftware.com
Mon May 5 14:02:13 CEST 2003


At 01:29 PM 4/30/03, elijah wrote:

>On Wed, 30 Apr 2003, David Relson wrote:
>
> > I'd recommend a two part attack on database size.  First enable
> > "ham_cutoff=0.1" in your bogofilter.cf file.  This will activate tristate
> > mode in which messages are labeled as "Yes", "No", and "Unsure".  This will
> > enable you to easily find the messages that bogofilter couldn't classify
> > within its level of certainty (as determined by the spam_cutoff and
> > ham_cutoff values).  Then manually train bogofilter with all the Unsures,
> > as well as any false positives or false negatives that occur.
> >
> > Also, if you've been using '-u', I hope you've been checking the results
> > and correcting any mistakes.  If you choose to follow the suggestion in the
> > above paragraph, remove the "-u" from your procmail recipe.
>
>Ahh, database size: based on past posts, it was my understanding that
>using the only-manually-train-on-unsure method and the
>only-train-on-corrections method made it so that you could not trim
>database size by removing old tokens. This is because useful tokens
>leading to correct categorization don't have their date updated.
>
>Am I correct in this understanding?

More or less, but not totally.  AFAIK, nobody has done any research on 
this.  Just as the definition of spam varies from person to person, which 
updating method to use is a matter of preference, as is the definition of 
"old".

The combination of autoupdating ('-u' flag) and training on unsures will 
give the most up-to-date wordlists.  Using train-on-unsure will update 
fewer words and will do it more slowly.  Don't forget that each time you 
train-on-unsure, a variety of words get updated.

My best judgement is that either updating method can be used, but the 
definition of "old" tokens would be different.  With autoupdating, "old" 
should be less old than with train-on-unsures.  I don't know what the 
values should be.  You _could_ do something like use 60 as a minimum age 
for auto-updating and use 90 with train-on-unsure.

>I am worried about the possibility of a short term solution to keeping
>database size low which results in a gradually growing database which
>cannot be trimmed using a long term solution. Database size is an issue
>for me because I am working with an isp type situation.

How big are your databases?  What are the byte sizes and the word 
counts?  The following script will display the pertinent info:

#!/bin/sh

# count.sh
#
#       display sizes and counts for $BOGOFILTER_DIR

ls -lh $BOGOFILTER_DIR/????list.db
bogoutil -w $BOGOFILTER_DIR .MSG_COUNT
for f in  $BOGOFILTER_DIR/????list.db ; do
     echo "$f  " `bogoutil -d $f | wc -l`
done





More information about the Bogofilter mailing list