New script to train bogofilter
Boris 'pi' Piwinger
3.14 at logic.univie.ac.at
Mon Jun 30 17:25:01 CEST 2003
Hi!
I wrote a perl script which trains bogofilter on error. It
produces very small databases. We'll have to see how good
that works. Any comments are warmly welcome.
Here is a sample run:
[3.14 at pi ~/local/bogolists]$ build-bogofilter-database.pl \
.bogofilter 'ham*' 'spam*'
Starting with this database:
(empty)
Done:
Read 22457 ham mails and 14806 spam mails.
Added 196 ham mails and 202 spam mails to the database.
spam good
.MSG_COUNT 202 196
False negatives: 232
False positives: 52
[3.14 at pi ~/local/bogolists]$ build-bogofilter-database.pl \
.bogofilter 'ham*' 'spam*'
Starting with this database:
spam good
.MSG_COUNT 202 196
Done:
Read 22457 ham mails and 14806 spam mails.
Added 69 ham mails and 91 spam mails to the database.
spam good
.MSG_COUNT 293 265
False negatives: 68
False positives: 20
[3.14 at pi ~/local/bogolists]$ build-bogofilter-database.pl \
.bogofilter 'ham*' 'spam*'
Starting with this database:
spam good
.MSG_COUNT 293 265
Done:
Read 22457 ham mails and 14806 spam mails.
Added 17 ham mails and 48 spam mails to the database.
spam good
.MSG_COUNT 341 282
False negatives: 1
False positives: 6
pi
-------------- next part --------------
#!/usr/bin/perl
# Script to train bogofilter from mboxes
# by Boris 'pi' Piwinger <3.14 at piology.org>
# Not correct number of parameters
unless (scalar(@ARGV)==3 || (scalar(@ARGV)==4 && $ARGV[0] eq "-v")) {
print <<END unless (scalar(@ARGV)==3 || (scalar(@ARGV)==4 && $ARGV[0] eq "-v"));
Usage:
build-bogofilter-database [-v] <database-directory> <ham-mboxes> <spam-mboxes>
Trains bogofilter where needed only.
Run formail -es on your files before you use them.
It may be a good idea to run the same command several times.
Example:
build-bogofilter-database.pl .bogofilter 'ham*' 'spam*'
END
exit;
}
# Check input and open
my$verbose; if (scalar(@ARGV)==4) {$verbose=1; shift (@ARGV);}
my($dir,$ham,$spam)=@ARGV;
my$temp; srand; do {$temp="/tmp/".rand();} until (!-e $temp);
die ("$dir is not a directory or not accessible.\n") unless (-d $dir && -r $dir && -w $dir && -x $dir);
print "\nStarting with this database:\n";
my$dbexists=(-s "$dir/goodlist.db");
if ($dbexists) {print `bogoutil -w $dir .MSG_COUNT`;} else {print " (empty)\n";}
open (HAM, "cat $ham|") || die("Cannot open ham: $!\n");
open (SPAM, "cat $spam|") || die("Cannot open spam: $!\n");
# Loop through all the mails
# bogofilter return codes: 0 for spam; 1 for non-spam
my($lasthamline,$lastspamline,$hamcount,$spamcount,$hamadd,$spamadd)=("","",0,0,0,0);
do {
# Read one mail from ham box and test, train as needed
unless (eof(HAM)) {
my$mail=$lasthamline;
$lasthamline="";
while (defined(my$line=<HAM>)) {
if ($line=~/^From /) {$lasthamline=$line; last;}
$mail.=$line;
}
if ($mail) {
$hamcount++;
open (TEMP, ">$temp") || die "Cannot write to temp file: $!";
print TEMP $mail;
close (TEMP);
unless ($dbexists && system("bogofilter -d $dir <$temp")/256==1) {
system("bogofilter -d $dir -n <$temp");
$hamadd++;
$dbexists=1;
print "Training ham.\n" if ($verbose);
} else {print "Not training ham.\n" if ($verbose);}
unlink ($temp);
}
}
# Read one mail from spam box and test, train as needed
unless (eof(SPAM)) {
my$mail=$lastspamline;
$lastspamline="";
while (!eof(SPAM) && defined(my$line=<SPAM>)) {
if ($line=~/^From /) {$lastspamline=$line; last;}
$mail.=$line;
}
if ($mail) {
$spamcount++;
open (TEMP, ">$temp") || die "Cannot write to temp file: $!";
print TEMP $mail;
close (TEMP);
unless (system("bogofilter -d $dir <$temp")/256==0) {
system("bogofilter -d $dir -s <$temp");
$spamadd++;
print "Training spam.\n" if ($verbose);
} else {print "Not training spam.\n" if ($verbose);}
unlink ($temp);
}
}
} until (eof(HAM) && eof(SPAM));
close (HAM);
close (SPAM);
print "\nDone:\n";
print "Read $hamcount ham mails and $spamcount spam mails.\n";
print "Added $hamadd ham mails and $spamadd spam mails to the database.\n";
print `bogoutil -w $dir .MSG_COUNT`;
print "\nFalse negatives: ", `cat $spam | bogofilter -d $dir -vM | grep -c Ham`;
print "False positives: ", `cat $ham | bogofilter -d $dir -vM | grep -c Spam`, "\n";
More information about the bogofilter
mailing list