How to deal with extremely high spam levels

Bob Vincent bogofilter at bobvincent.org
Tue Jul 13 13:43:44 CEST 2004


On Sun, Jul 11, 2004 at 03:07:14PM -0400, Tom Anderson wrote:
> On Sun, 2004-07-11 at 14:52, Bob Vincent wrote:
> > On Sun, Jul 11, 2004 at 10:30:10AM -0400, Tom Anderson wrote:
> > > Even when automated, that sounds like a complex process.  Do you think
> > > regular users will be able to do this?
> > 
> > At 4000 spams per day, I'm hardly a "regular user."
> 
> 
> Nonetheless, in a few months to a year, "regular users" may be receiving
> nearly this level of spams.  My rate has been increasing exponentially. 
> It'd be nice to have a system in place that can handle any level of
> spams for any user.  That's why I'd like to see some sort of
> normalization within bogofilter so that an imbalance of this sort does
> not cause catastrophic problems.
> 


Okay, enclosed is my autotrain script.

It's useful for me; it may or may not be useful for anybody else.

It relies on several aspects of my setup:

  * My email provider allows IMAP access.

  * I am using offlineimap to sync with the remote IMAP server.

  * I am using curl to upload the resultant database.


At the top of the script are several tunable parameters, including:

  $maxruns -- The maximum number of times a single message will be trained.
              If any message has to be trained more times than this, it is
              assumed to be mis-classified, and will be un-trained and deleted.

  $maxsave -- The maximum number of spams or hams to save.  If we have more
              spam or ham messages, the oldest ones not used for training
              are deleted.


           
-------------- next part --------------
#!/usr/bin/perl
$home = "/home/bobvin";
$bin = "$home/bin";
$maxruns = 4;
$maxsave = 1000;
$bogofilter = "$bin/bogofilter -c $home/.bogofilter/config";
$bogodir = "$home/.bogofilter";
$spamfolder="$home/Mail/spam";
$goodfolder="$home/Mail/ham";
$dbfile = "$bogodir/wordlist.qdbm";
$remote = "ftp://lolmoh.pair.com/~/.bogofilter/wordlist.qdbm";
unlink $dbfile;
print('Updating local Maildirs from IMAP server...');
system('offlineimap');
print("Done.\n");
%tspam = ();
%tgood = ();
loadmaildir(\%tgood,$goodfolder);
loadmaildir(\%tspam,$spamfolder);
# Train one message to seed database...
system(sprintf("$bogofilter -n -I %s",each(%tgood)));
my @filenames;
my ($pos,$neg,$uns,$run) = (0,0,0,0);
do
  {
    ($pos,$neg,$uns) = (0,0,0);
    loadfilenames(\@filenames,\%tgood,\%tspam,\$run);
    foreach $filename ( @filenames )
      {
        system("$bogofilter -I '$filename'");
	if ($? == -1)
          {
            die("$bogofilter failed to execute: $!\n");
	  }
	elsif ($? & 127)
	  {
	    printf("$bogofilter died with signal %d\n",$? & 127);
	  }
	else
	  {
	    $status = ($? >> 8);
	    if (($status !=1) && defined($tgood{$filename})) # misclassified good
	      {
		train(\%tgood,($status==0) ? \$pos : \$uns,$filename,'n');
	      }
	    elsif (($status != 0) && defined($tspam{$filename})) # misclassified spam
	      {
		train(\%tspam,($status==1) ? \$neg : \$uns,$filename,'s');
	      }
	    else
	      {
		print(".");
	      }

	  }
      }
    printf("\n%d false positives; %d false negatives; %d unsures.\n",$pos,$neg,$uns);
  } until (($pos+$neg+$uns) == 0);
prune(\%tgood,$maxsave,'good');
prune(\%tspam,$maxsave,'spam');

print('Syncronizing Deletions with IMAP server...');
system('offlineimap');
print("Done.\n");

print('Uploading fully-trained database...');
system("curl -n -T $dbfile $remote");
print("Done.\n");

sub loadmaildir($$)
  {
    my $hashref = shift;
    my $folder = shift;
    my $dirpos;
    my $basename;
    foreach $subfolder ('cur','new')
      {
	print("Loading $folder/$subfolder...");
	opendir(TEMP,"$folder/$subfolder");
	while (readdir(TEMP) =~ /^\./)
	  {
	    $dirpos = telldir(TEMP);
	  }
	seekdir(TEMP, $dirpos);
	$dirpos = 0;
	while ($basename = readdir(TEMP))
	  {
	    $$hashref{"$folder/$subfolder/$basename"} = 0;
	    $dirpos++;
	  }
	printf("%d messages.\n",$dirpos);
      }
  }


sub loadfilenames($$$$)
  {
    my $filenames = shift;
    my $goodref = shift;
    my $spamref = shift;
    my $run = shift;
    $$run++;
    @$filenames = ();
    push @$filenames,keys %$goodref;
    push @$filenames,keys %$spamref;
    @$filenames = sort { rand() <=> rand() } @$filenames;
    printf("Run %d: Training with %d good and %d spam messages.\n",$$run, scalar keys %$goodref,scalar keys %$spamref, scalar @$filenames);
  }

    
sub train($$$$)
  {
    my $hashref = shift;
    my $countref = shift;
    my $filename = shift;
    my $tchar = shift;
    my $runs;
    $$countref++;
    if ( ($runs = ++$$hashref{$filename}) > $maxruns)
      {
	$tchar = uc($tchar);
	while (--$runs)
	  {
	    system("$bogofilter -$tchar -I '$filename'");
	  }
	unlink($filename);
	undef($$hashref{$filename});
      }
    else
      {
	system("$bogofilter -$tchar -I '$filename'");
      }
    print($tchar);
  }

sub prune($$$)
  {
    my $hashref = shift;
    my $maxsave = shift;
    my $msgkind = shift;
    my @filenames = keys %$hashref;
    my $numfiles = scalar(@filenames);
    if ($numfiles > $maxsave)
      {
	my $todelete = $numfiles - $maxsave;
	printf("Deleting oldest %d %s messages not used for training.\n",
	       $todelete, $msgkind);
	foreach ($filename, sort { (-M $b) <=> (-M $a) } @filenames)
	  {
	    if ($$hashref{$filename}==0)
	      {
		unlink($filename);
		break unless --$todelete;
	      }
	  }
      }



More information about the Bogofilter mailing list