New new script to train bogofilter
Boris 'pi' Piwinger
3.14 at logic.univie.ac.at
Fri Jul 4 12:37:53 CEST 2003
Hi!
Here is the new version of my script. New options allow for
more precise information and for forcing to run until no
error is left.
I started from scratch. Six runs were needed to go down to
zero errors. The program read 22532 ham mails and 15217 spam
mails. The database ended with this:
spam good
.MSG_COUNT 329 283
% of read 2.2 1.3
It turns out, that not a single message was used twice in
training (by accident, but it might cool down worries;-).
I am curious to hear why this is not great and why I should
see problems soon.
pi
-------------- next part --------------
#!/usr/bin/perl
# Script to train bogofilter from mboxes
# by Boris 'pi' Piwinger <3.14 at piology.org>
# Not correct number of parameters
unless (scalar(@ARGV)==3 || (scalar(@ARGV)==4 && $ARGV[0] =~ /^-(?=[vf]{1,3}$)(?:f?v{0,2}|v+f)$/)) {
print <<END unless (scalar(@ARGV)==3 || (scalar(@ARGV)==4 && $ARGV[0] eq "-v"));
Usage:
build-bogofilter-database [-[f][v[v]]] <database-directory> <ham-mboxes> <spam-mboxes>
Trains bogofilter where needed only.
Run formail -es on your files before you use them.
It may be a good idea to run the same command several times.
-f does that for you.
Example:
build-bogofilter-database.pl .bogofilter 'ham*' 'spam*'
Options:
Note: You must not give more than one option, so you must combine them.
-v This switch produces info on messages used for training.
-vv In addition messages not used for training are listed.
-f Runs the program until no error remains.
END
exit;
}
# Check input and open
my$force=1 if (scalar(@ARGV)==4 && $ARGV[0]=~s/f//);
my$verbose=1 if (scalar(@ARGV)==4 && $ARGV[0]=~s/^-v/-/);
my$vverbose=1 if (scalar(@ARGV)==4 && $ARGV[0] eq "-v");
shift (@ARGV) if (scalar(@ARGV)==4);
my($dir,$ham,$spam)=@ARGV;
my$temp; srand; do {$temp="/tmp/".rand();} until (!-e $temp);
die ("$dir is not a directory or not accessible.\n") unless (-d $dir && -r $dir && -w $dir && -x $dir);
print "\nStarting with this database:\n";
my$dbexists=(-s "$dir/goodlist.db");
if ($dbexists) {print `bogoutil -w $dir .MSG_COUNT`;} else {print " (empty)\n";}
my($fp,$fn);
do { # Start force loop
open (HAM, "cat $ham|") || die("Cannot open ham: $!\n");
open (SPAM, "cat $spam|") || die("Cannot open spam: $!\n");
# Loop through all the mails
# bogofilter return codes: 0 for spam; 1 for non-spam
my($lasthamline,$lastspamline,$hamcount,$spamcount,$hamadd,$spamadd)=("","",0,0,0,0);
do {
# Read one mail from ham box and test, train as needed
unless (eof(HAM)) {
my$mail=$lasthamline;
$lasthamline="";
while (defined(my$line=<HAM>)) {
if ($line=~/^From /) {$lasthamline=$line; last;}
$mail.=$line;
}
if ($mail) {
$hamcount++;
open (TEMP, ">$temp") || die "Cannot write to temp file: $!";
print TEMP $mail;
close (TEMP);
unless ($dbexists && system("bogofilter -d $dir <$temp")/256==1) {
system("bogofilter -d $dir -n <$temp");
$hamadd++;
$dbexists=1;
print "Training ham message $hamcount.\n" if ($verbose);
} else {print "Not training ham message $hamcount..\n" if ($vverbose);}
unlink ($temp);
}
}
# Read one mail from spam box and test, train as needed
unless (eof(SPAM)) {
my$mail=$lastspamline;
$lastspamline="";
while (!eof(SPAM) && defined(my$line=<SPAM>)) {
if ($line=~/^From /) {$lastspamline=$line; last;}
$mail.=$line;
}
if ($mail) {
$spamcount++;
open (TEMP, ">$temp") || die "Cannot write to temp file: $!";
print TEMP $mail;
close (TEMP);
unless (system("bogofilter -d $dir <$temp")/256==0) {
system("bogofilter -d $dir -s <$temp");
$spamadd++;
print "Training spam message $spamcount.\n" if ($verbose);
} else {print "Not training spam message $spamcount.\n" if ($vverbose);}
unlink ($temp);
}
}
} until (eof(HAM) && eof(SPAM));
close (HAM);
close (SPAM);
print "\nDone:\n";
print "Read $hamcount ham mails and $spamcount spam mails.\n";
print "Added $hamadd ham mails and $spamadd spam mails to the database.\n";
print `bogoutil -w $dir .MSG_COUNT`;
$fn=`cat $spam | bogofilter -d $dir -vM | grep -c Ham`;
print "\nFalse negatives: $fn";
$fp=`cat $ham | bogofilter -d $dir -vM | grep -c Spam`;
print "False positives: $fp\n";
} until ($fn+$fp==0 || !$force)
More information about the bogofilter
mailing list