New new script to train bogofilter

Boris 'pi' Piwinger 3.14 at logic.univie.ac.at
Fri Jul 4 12:37:53 CEST 2003


Hi!

Here is the new version of my script. New options allow for
more precise information and for forcing to run until no
error is left.

I started from scratch. Six runs were needed to go down to
zero errors. The program read 22532 ham mails and 15217 spam
mails. The database ended with this:
                       spam   good
.MSG_COUNT              329    283
% of read               2.2    1.3

It turns out, that not a single message was used twice in
training (by accident, but it might cool down worries;-).

I am curious to hear why this is not great and why I should
see problems soon.

pi
-------------- next part --------------
#!/usr/bin/perl
# Script to train bogofilter from mboxes
# by Boris 'pi' Piwinger <3.14 at piology.org>

# Not correct number of parameters
unless (scalar(@ARGV)==3 || (scalar(@ARGV)==4 && $ARGV[0] =~ /^-(?=[vf]{1,3}$)(?:f?v{0,2}|v+f)$/)) {
  print <<END unless (scalar(@ARGV)==3 || (scalar(@ARGV)==4 && $ARGV[0] eq "-v"));

Usage:
  build-bogofilter-database [-[f][v[v]]] <database-directory> <ham-mboxes> <spam-mboxes>
  Trains bogofilter where needed only.
  Run formail -es on your files before you use them.
  It may be a good idea to run the same command several times.
   -f does that for you.

Example:
  build-bogofilter-database.pl .bogofilter 'ham*' 'spam*'

Options:
  Note: You must not give more than one option, so you must combine them.
  -v   This switch produces info on messages used for training.
  -vv  In addition messages not used for training are listed.
  -f   Runs the program until no error remains.

END
  exit;
}

# Check input and open
my$force=1 if (scalar(@ARGV)==4 && $ARGV[0]=~s/f//);
my$verbose=1 if (scalar(@ARGV)==4 && $ARGV[0]=~s/^-v/-/);
my$vverbose=1 if (scalar(@ARGV)==4 && $ARGV[0] eq "-v");
shift (@ARGV) if (scalar(@ARGV)==4);
my($dir,$ham,$spam)=@ARGV;
my$temp; srand; do {$temp="/tmp/".rand();} until (!-e $temp);

die ("$dir is not a directory or not accessible.\n") unless (-d $dir && -r $dir && -w $dir && -x $dir);
print "\nStarting with this database:\n";
my$dbexists=(-s "$dir/goodlist.db");
if ($dbexists) {print `bogoutil -w $dir .MSG_COUNT`;} else {print "  (empty)\n";}

my($fp,$fn);
do { # Start force loop
open (HAM, "cat $ham|")   || die("Cannot open ham: $!\n");
open (SPAM, "cat $spam|") || die("Cannot open spam: $!\n");

# Loop through all the mails
# bogofilter return codes: 0 for spam; 1 for non-spam
my($lasthamline,$lastspamline,$hamcount,$spamcount,$hamadd,$spamadd)=("","",0,0,0,0);
do {

# Read one mail from ham box and test, train as needed
unless (eof(HAM)) {
  my$mail=$lasthamline;
  $lasthamline="";
  while (defined(my$line=<HAM>)) {
    if ($line=~/^From /) {$lasthamline=$line; last;}
    $mail.=$line;
  }
  if ($mail) {
    $hamcount++;
    open (TEMP, ">$temp") || die "Cannot write to temp file: $!";
    print TEMP $mail;
    close (TEMP);
    unless ($dbexists && system("bogofilter -d $dir <$temp")/256==1) {
      system("bogofilter -d $dir -n <$temp");
      $hamadd++;
      $dbexists=1;
      print "Training ham message $hamcount.\n" if ($verbose);
    } else {print "Not training ham message $hamcount..\n" if ($vverbose);}
    unlink ($temp);
  }
}
  
# Read one mail from spam box and test, train as needed
unless (eof(SPAM)) {
  my$mail=$lastspamline;
  $lastspamline="";
  while (!eof(SPAM) && defined(my$line=<SPAM>)) {
    if ($line=~/^From /) {$lastspamline=$line; last;}
    $mail.=$line;
  }
  if ($mail) {
    $spamcount++;
    open (TEMP, ">$temp") || die "Cannot write to temp file: $!";
    print TEMP $mail;
    close (TEMP);
    unless (system("bogofilter -d $dir <$temp")/256==0) {
      system("bogofilter -d $dir -s <$temp");
      $spamadd++;
      print "Training spam message $spamcount.\n" if ($verbose);
    } else {print "Not training spam message $spamcount.\n" if ($vverbose);}
    unlink ($temp);
  }
}

} until (eof(HAM) && eof(SPAM));
close (HAM);
close (SPAM);

print "\nDone:\n";
print "Read $hamcount ham mails and $spamcount spam mails.\n";
print "Added $hamadd ham mails and $spamadd spam mails to the database.\n";
print `bogoutil -w $dir .MSG_COUNT`;
$fn=`cat $spam | bogofilter -d $dir -vM | grep -c Ham`;
print "\nFalse negatives: $fn";
$fp=`cat $ham | bogofilter -d $dir -vM | grep -c Spam`;
print "False positives: $fp\n";
} until ($fn+$fp==0 || !$force)



More information about the bogofilter mailing list