Performance test framework

Mark M. Hoffman mhoffman at lightlink.com
Thu Sep 19 08:16:47 CEST 2002


Hi everyone:

A couple days ago, I volunteered (to Adrian) to create a test
framework for bogofilter.  You can find the first cut here:

http://home.pacbell.net/mh0ffman/bogotest-0.1.tar.gz

Usage should be clear from the embedded PODs.  Feedback 
(+/-) is appreciated.  Please be gentle; I'm not a Perl guru. ;)

bogotest-perf.pl

I use the suffix "perf" to distinguish it from any kind of
algorithm correctness or regression test, which I would like
to work on later.  The pace of bogofilter devlopment right
now would make any algorithm correctness tests almost useless.

As briefly as possible, this script requires a directory
full of messages (msg.*).  For every message M, it trains
bogofilter on all the other messages !M.  It then  reports
bogofilter's prediction for M.  It repeats this process
for every message in the test directory.  Some final
statistics are printed before it quits.

For each of these runs, the local spamlist/hamlist
(*not* in $HOME/.bogofilter) are blown away and recreated.
The only other file required to run bogotest-perf.pl
is "spamlist.txt", which is a list of spam message filenames
one per line.

This test is compute-intensive; there's no way around it.  The
test as a whole reads each message file a number of times
equal to the number of files, making it O(n^2).  However, I've
minimized the number of bogofilter invocations to 3x number
of messages... train good, train spam, and evaluate.

bogotest-classify.pl

This script automates classifying of spam messages by hand,
if that makes any sense.  Its output is the file "spamlist.txt"
as required by bogotest-perf.pl.  This script is very raw
in terms of features... e.g. if you add new msgs to the test
directory you will have to reclassify all of them - sorry.

Regarding Test Data...

I haven't included any, but I intend to.  These scripts' 
usefulness depends on having one or more largish public
data sets.  I think a good source for those are unmoderated
public mailing lists that get spammed.  That avoids the
problem of including personal emails in public test data.

It is very convenient to populate a test directory using
the programs formail and procmail.  I will try to add generic
sample scripts to demonstrate this in the future.

I used bogotest-perf.pl against my personal archives of the
linux sensors mailing list, which is unmoderated.  Here's a
peek at the final statistics: (bogofilter-0.7.3)

--------------------
Correct:        328 of 341 ( 96.2%)
Missed Spam:     13 of 143 (  9.1%)
False Positive:   0

Regards,

-- 
Mark M. Hoffman
mhoffman at lightlink.com



More information about the bogofilter-dev mailing list