Bogofilter Tuning Issues...
David Relson
relson at osagesoftware.com
Mon May 5 14:02:13 CEST 2003
At 01:29 PM 4/30/03, elijah wrote:
>On Wed, 30 Apr 2003, David Relson wrote:
>
> > I'd recommend a two part attack on database size. First enable
> > "ham_cutoff=0.1" in your bogofilter.cf file. This will activate tristate
> > mode in which messages are labeled as "Yes", "No", and "Unsure". This will
> > enable you to easily find the messages that bogofilter couldn't classify
> > within its level of certainty (as determined by the spam_cutoff and
> > ham_cutoff values). Then manually train bogofilter with all the Unsures,
> > as well as any false positives or false negatives that occur.
> >
> > Also, if you've been using '-u', I hope you've been checking the results
> > and correcting any mistakes. If you choose to follow the suggestion in the
> > above paragraph, remove the "-u" from your procmail recipe.
>
>Ahh, database size: based on past posts, it was my understanding that
>using the only-manually-train-on-unsure method and the
>only-train-on-corrections method made it so that you could not trim
>database size by removing old tokens. This is because useful tokens
>leading to correct categorization don't have their date updated.
>
>Am I correct in this understanding?
More or less, but not totally. AFAIK, nobody has done any research on
this. Just as the definition of spam varies from person to person, which
updating method to use is a matter of preference, as is the definition of
"old".
The combination of autoupdating ('-u' flag) and training on unsures will
give the most up-to-date wordlists. Using train-on-unsure will update
fewer words and will do it more slowly. Don't forget that each time you
train-on-unsure, a variety of words get updated.
My best judgement is that either updating method can be used, but the
definition of "old" tokens would be different. With autoupdating, "old"
should be less old than with train-on-unsures. I don't know what the
values should be. You _could_ do something like use 60 as a minimum age
for auto-updating and use 90 with train-on-unsure.
>I am worried about the possibility of a short term solution to keeping
>database size low which results in a gradually growing database which
>cannot be trimmed using a long term solution. Database size is an issue
>for me because I am working with an isp type situation.
How big are your databases? What are the byte sizes and the word
counts? The following script will display the pertinent info:
#!/bin/sh
# count.sh
#
# display sizes and counts for $BOGOFILTER_DIR
ls -lh $BOGOFILTER_DIR/????list.db
bogoutil -w $BOGOFILTER_DIR .MSG_COUNT
for f in $BOGOFILTER_DIR/????list.db ; do
echo "$f " `bogoutil -d $f | wc -l`
done
More information about the Bogofilter
mailing list