bogofilter-tuning.HOWTO

Sun Feb 1 20:34:59 CET 2004

On Sun, Feb 01, 2004 at 01:56:57PM -0500, Tom Anderson wrote:
>
> Without training, as spam tactics mutate over time, your database may
> become as equally unusable. So, in either case, you still have to
> train consistently.

What I have setup in my office is that a user can correct a
misclassification by moving spam/ham into a "makespam" or "makegood"
IMAP folder and it will be given the -Ns or -Sn treatment.  If email's
left in the spam or Inbox folder for a certain about of time it will be
registered with -n or -s.

All this is done via a cronjob so only one process is updating the
wordlist at one time.

Some of my users have complained that no matter how many times they
correct bogofilter similiar emails are still misclassified.  Since I
save the re-classifications I looked into some of them.  I tried doing a
-Ns for ones that should be spam and it doesn't change their score no
matter how many times I did it.

What I think is happening is that the spam/ham counts are getting way
out of wack.  So, for example, if you had 1000 good counts for a word
and only 10 spam counts if suddenly a look of spams had that word it
would take 900+ (or maybe 450+ if you do a -Ns) emails before that word
would be counted towards spam.

To correct this problem of changing spam I "normalized" the wordlists to
a certain max score, say 50.  So the 1000/200 could would be reduced to
50/10 -- meaning that only 40 different emails could change this count's
value.

I did it to the users that were complaining about the re-classification
script not working for them and so far it looks like it has improved
their situation.

The script is listed below.  To use it pipe in the results from
bogoutil:
  bogoutil -d ./wordlist.db | bfnormalize.pl 50 | bogoutil -d ./new.db
and the wordlist database new.db will be "normallized" to a max score of
50.

Please let me know if there's any faults in doing this.  I can't see it
screwing up the current classifications as spam counts per word *should*
be the same, and it should help on re-classifications.

Chris

#!/usr/bin/perl

my $norm = shift || 20;

while (<STDIN>) {
	my ($word, $spam, $ham, $date) = split;
	if ($spam > $norm || $ham > $norm) {
		print STDERR "$word $spam $ham $date\n";
		my ($max, $div);
		$max = ($spam > $ham) ? $spam : $ham;
		$div = $max / $norm;
		$spam = int($spam/$div);
		$ham = int($ham/$div);
	}
	print "$word $spam $ham $date\n";
}