How to deal with extremely high spam levels
Bob Vincent
bogofilter at bobvincent.org
Tue Jul 13 13:43:44 CEST 2004
On Sun, Jul 11, 2004 at 03:07:14PM -0400, Tom Anderson wrote:
> On Sun, 2004-07-11 at 14:52, Bob Vincent wrote:
> > On Sun, Jul 11, 2004 at 10:30:10AM -0400, Tom Anderson wrote:
> > > Even when automated, that sounds like a complex process. Do you think
> > > regular users will be able to do this?
> >
> > At 4000 spams per day, I'm hardly a "regular user."
>
>
> Nonetheless, in a few months to a year, "regular users" may be receiving
> nearly this level of spams. My rate has been increasing exponentially.
> It'd be nice to have a system in place that can handle any level of
> spams for any user. That's why I'd like to see some sort of
> normalization within bogofilter so that an imbalance of this sort does
> not cause catastrophic problems.
>
Okay, enclosed is my autotrain script.
It's useful for me; it may or may not be useful for anybody else.
It relies on several aspects of my setup:
* My email provider allows IMAP access.
* I am using offlineimap to sync with the remote IMAP server.
* I am using curl to upload the resultant database.
At the top of the script are several tunable parameters, including:
$maxruns -- The maximum number of times a single message will be trained.
If any message has to be trained more times than this, it is
assumed to be mis-classified, and will be un-trained and deleted.
$maxsave -- The maximum number of spams or hams to save. If we have more
spam or ham messages, the oldest ones not used for training
are deleted.
-------------- next part --------------
#!/usr/bin/perl
$home = "/home/bobvin";
$bin = "$home/bin";
$maxruns = 4;
$maxsave = 1000;
$bogofilter = "$bin/bogofilter -c $home/.bogofilter/config";
$bogodir = "$home/.bogofilter";
$spamfolder="$home/Mail/spam";
$goodfolder="$home/Mail/ham";
$dbfile = "$bogodir/wordlist.qdbm";
$remote = "ftp://lolmoh.pair.com/~/.bogofilter/wordlist.qdbm";
unlink $dbfile;
print('Updating local Maildirs from IMAP server...');
system('offlineimap');
print("Done.\n");
%tspam = ();
%tgood = ();
loadmaildir(\%tgood,$goodfolder);
loadmaildir(\%tspam,$spamfolder);
# Train one message to seed database...
system(sprintf("$bogofilter -n -I %s",each(%tgood)));
my @filenames;
my ($pos,$neg,$uns,$run) = (0,0,0,0);
do
{
($pos,$neg,$uns) = (0,0,0);
loadfilenames(\@filenames,\%tgood,\%tspam,\$run);
foreach $filename ( @filenames )
{
system("$bogofilter -I '$filename'");
if ($? == -1)
{
die("$bogofilter failed to execute: $!\n");
}
elsif ($? & 127)
{
printf("$bogofilter died with signal %d\n",$? & 127);
}
else
{
$status = ($? >> 8);
if (($status !=1) && defined($tgood{$filename})) # misclassified good
{
train(\%tgood,($status==0) ? \$pos : \$uns,$filename,'n');
}
elsif (($status != 0) && defined($tspam{$filename})) # misclassified spam
{
train(\%tspam,($status==1) ? \$neg : \$uns,$filename,'s');
}
else
{
print(".");
}
}
}
printf("\n%d false positives; %d false negatives; %d unsures.\n",$pos,$neg,$uns);
} until (($pos+$neg+$uns) == 0);
prune(\%tgood,$maxsave,'good');
prune(\%tspam,$maxsave,'spam');
print('Syncronizing Deletions with IMAP server...');
system('offlineimap');
print("Done.\n");
print('Uploading fully-trained database...');
system("curl -n -T $dbfile $remote");
print("Done.\n");
sub loadmaildir($$)
{
my $hashref = shift;
my $folder = shift;
my $dirpos;
my $basename;
foreach $subfolder ('cur','new')
{
print("Loading $folder/$subfolder...");
opendir(TEMP,"$folder/$subfolder");
while (readdir(TEMP) =~ /^\./)
{
$dirpos = telldir(TEMP);
}
seekdir(TEMP, $dirpos);
$dirpos = 0;
while ($basename = readdir(TEMP))
{
$$hashref{"$folder/$subfolder/$basename"} = 0;
$dirpos++;
}
printf("%d messages.\n",$dirpos);
}
}
sub loadfilenames($$$$)
{
my $filenames = shift;
my $goodref = shift;
my $spamref = shift;
my $run = shift;
$$run++;
@$filenames = ();
push @$filenames,keys %$goodref;
push @$filenames,keys %$spamref;
@$filenames = sort { rand() <=> rand() } @$filenames;
printf("Run %d: Training with %d good and %d spam messages.\n",$$run, scalar keys %$goodref,scalar keys %$spamref, scalar @$filenames);
}
sub train($$$$)
{
my $hashref = shift;
my $countref = shift;
my $filename = shift;
my $tchar = shift;
my $runs;
$$countref++;
if ( ($runs = ++$$hashref{$filename}) > $maxruns)
{
$tchar = uc($tchar);
while (--$runs)
{
system("$bogofilter -$tchar -I '$filename'");
}
unlink($filename);
undef($$hashref{$filename});
}
else
{
system("$bogofilter -$tchar -I '$filename'");
}
print($tchar);
}
sub prune($$$)
{
my $hashref = shift;
my $maxsave = shift;
my $msgkind = shift;
my @filenames = keys %$hashref;
my $numfiles = scalar(@filenames);
if ($numfiles > $maxsave)
{
my $todelete = $numfiles - $maxsave;
printf("Deleting oldest %d %s messages not used for training.\n",
$todelete, $msgkind);
foreach ($filename, sort { (-M $b) <=> (-M $a) } @filenames)
{
if ($$hashref{$filename}==0)
{
unlink($filename);
break unless --$todelete;
}
}
}
More information about the Bogofilter
mailing list