Will spam/ham counts in wordlist affect spamicity?

Thu Sep 18 00:40:59 CEST 2003

The subject's a little confusing, what I want to do is take all my
user's wordlist.dbs and combine them into one word list that my qmail
server can use right before it completes the SMTP session, so that it
can decide to reject or accept a mail w/o having to deliver it to a user
who is then going to filter it as spam anyway.  This is done with using
the qmailqueue patch.

I wrote a quick perl script to combine wordlists listed on the command
line and you can pass it some switches, mainly:
  -c #        minimum count for a token (good + spam)
  -m #        minimum ratio between good/spam or spam/good
  -d ######## maximum date that the oldest token can be
So I can now coalesce many wordlist.dbs and throw out all good words
without a count of at least one (-c 1) and the ratio of spam words to
good words of 2 (-m 2).

The ratio bit I put in there to only allow "sure" good or spam words.
For example if a word "panda" is in there with this count from
bogoutil -d :
  panda 2 6
the ratio of this word is 6/2 = 3.  Likewise if cartridges is in there
this this bogoutil -d output:
  cartriges 10 1
the ratio is 10/1 = 10.

With this -m multipler factor I'm hoping to throw out unsure words that
have close spam and good counts.  If my ratio was "-m 4" the panda word
wouldn't be included in my total list.  "-m 1" would throw out words
with the same spam and good counts.

I'm doing this so that I can get a global wordlist.db that's pretty
loose in classifying a spam message.  I figure if I only include those
words that are sure bets then bogofilter should let more things slide.

Is this the case?  Is this even worthwhile?  The main reason for doing
this is to give my users the benefits of bogofilter before moving them
over to the new server where they can have their own wordlists.

Chris

#!/usr/bin/perl

# takes a list of wordlists on the command line
# and adds up all their scores

use Getopt::Std;
use strict;

getopts('m:d:c:');
our ($opt_m, $opt_d, $opt_c, %info);

checkformat($opt_m, 'Multiplier', '^\d*\.*\d*$');
checkformat($opt_c, 'Total Count', '^\d+$');
checkformat($opt_d, 'Date',       '^\d{8}$');

die "Must pass me some files to look at\n"
  unless (@ARGV);

foreach my $file (@ARGV) {
	unless (-f $file) {
		print STDERR "Skipping '$file' as not a file\n";
		next;
	}
	my @words = `bogoutil -d $file`;
	foreach (@words) {
		chomp;
		my ($token, $spamcount, $hamcount, $date) = split;
		$info{$token}{spamcount} += $spamcount;
		$info{$token}{hamcount}  += $hamcount;
		push @{$info{$token}{date}}, $date;
	}
}

foreach my $token (keys %info) {
	my ($spamcount, $goodcount, $date);
	$spamcount = $info{$token}{spamcount};
	$goodcount = $info{$token}{hamcount};
	# if either count is 0 or the multipler is
   # zero, print out this entry
	if ($opt_m && $spamcount && $goodcount) { 
		my $ratio = ($goodcount > $spamcount) ? $goodcount / $spamcount : $spamcount / $goodcount;
		next if ($ratio <= $opt_m);
	}
	next if ($opt_c && (($spamcount + $goodcount) <= $opt_c));
	$date = (sort { $b <=> $a } @{$info{$token}{date}})[0];
	next unless ($date > $opt_d);
	print "$token $spamcount $goodcount $date\n";
}

sub checkformat {
	my ($val, $type, $regex) = @_;
	return unless ($val);
	die "Value '$val' for the $type didn't match regex $regex\n"
	  unless ($val =~ /$regex/);
}