bogofilter-tuning.HOWTO

David Relson relson at osagesoftware.com
Sun Feb 1 20:59:20 CET 2004


On Sun, 1 Feb 2004 11:34:59 -0800
Chris Wilkes wrote:

...[snip]...

> #!/usr/bin/perl
> 
> my $norm = shift || 20;
> 
> while (<STDIN>) {
> 	my ($word, $spam, $ham, $date) = split;
> 	if ($spam > $norm || $ham > $norm) {
> 		print STDERR "$word $spam $ham $date\n";
> 		my ($max, $div);
> 		$max = ($spam > $ham) ? $spam : $ham;
> 		$div = $max / $norm;
> 		$spam = int($spam/$div);
> 		$ham = int($ham/$div);
> 	}
> 	print "$word $spam $ham $date\n";
> }		

Hi Chris,

If I'm understanding the above script and it's usage, the goal is to
keep the spam and ham counts for a token under 50.  Given 35 and 100 as
the values for a token, the script will normalize them to 17 and 50. 
Yes?

There _is_ a problem.  Bogofilter uses meta-token .MSG_COUNT to record
the number of ham and spam messages used in training.  When a token is
scored, its ham and spam counts are normalized using the .MSG_COUNT
values.

Now consider what happens when you start with an empty database and
train with, say, 1000 each of ham and spam.  Meta token .MSG_COUNT would
have value (1000,1000).  A heavily used token might have values
(800,200), indicating it's used in 80% of the spam and 20% of the ham. A
lightly used token might be (100,25).  After normalization, the two max
values would both be 50, i.e. the meta-token would be at (50,50) and the
other tokens would both be (50,12).  Bogofilter would interpret this as
100% of the spam contained the token, which is not correct.  Also,
having normalized the two tokens to the same value would cause
bogofilter to consider them equally valid ham/spam indicators.

Now, I'm not saying that you've broken bogofilter and have caused it to
fail.  The bayesian technique seems to tolerate many parametric changes
and still give good results.  I _am_ saying that the script's
normalization process is mathematically bogus.

What _might_ work better would be to use the .MSG_COUNT values and scale
everything relative to that.  A value of 1000 or 100 _might_ work.  To
be honest, I've not experimented with this and am speculating.

Of course, the fact that that you're getting satisfactory results is the
important thing.

Cheers!

David




More information about the Bogofilter mailing list