Comparing Graham's and Robinson's calculation methods (LONG)
Greg Louis
glouis at dynamicro.on.ca
Sun Nov 3 19:07:09 CET 2002
An html version of the following paper may be found at
http://www.bgl.nu/~glouis/bogofilter/test6000.html
Greg Louis
=============================================================
Testing bogofilter's calculation methods
Introduction and general description:
""""""""""""""""""""""""""""""""""""
The original version of bogofilter uses the computation method
presented in Paul Graham's paper A Plan for Spam
(http://www.paulgraham.com/spam.html). Gary Robinson took an interest
in Graham's paper, and wrote an insightful commentary
(http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html)
in which he presented several untested suggestions for improvements to
Graham's method. I was intrigued, and modified bogofilter 0.7 (and
subsequently 0.7.4 and 0.7.5) to try them out. Initial tests
(http://www.bgl.nu/~glouis/bogofilter) looked promising. Discussion
with David Relson (the developer who integrated my modifications into
bogofilter 0.7.6, and to whom I'm also grateful for reviewing a draft
of this paper) led to agreement that further testing would be
worthwhile; we were interested in answering (more completely) the
following two questions:
Does the Robinson method of calculating "spamicity" give better results
than the original method proposed by Paul Graham?
What is the effect on discrimination if a test set is added to the
training set and then re-evaluated? Can the learning effect be seen?
Is there a significant difference if the test set is manually
classified, or does automatically updating the training set based on
bogofilter's own decision suffice to ensure continuing good results?
It seemed to me that something like the following experimental design
would give us a chance at getting the answers:
1. Set up initial training databases for Graham, Robinson and
supervised training; these will be identical at the outset, and
will be based on previous supervised training.
2. Accumulate messages and classify by hand.
3. Create a data frame of
message-ordinal is-spam supervised-Graham-says \
supervised-Robinson-says unsupervised-Graham-says \
unsupervised-Robinson-says
4. Calculate percentage correct for Graham and Robinson, supervised
and unsupervised.
5. Add message group to training set based on training criterion:
supervised - use is-spam
unsupervised - update Graham database from
unsupervised-Graham-says, and Robinson database from
unsupervised-Robinson-says
6. Repeat steps 3 and 4 to see if there is a learning effect.
6. Repeat steps 2-6 till 4 rounds have been completed.
Upon reflection, it seemed worthwhile to add replication to this
procedure, so I adopted the following changes:
o As step 1, accumulate approximately 6000 messages and classify by
hand; then divide the spam and nonspam corpora into four groups of
3, one group per round. In each round, carry out steps 3-4 for
each third separately (call these runs 0-2 for the round).
o As step 2, set up initial training databases for Graham, Robinson
and supervised training; these will be identical at the outset of
each round, and will be based on supervised training. Start each
round with step 2; for round 2, use the training sets
produced by supervised training in round 1; in round 3, those
produced in round 2; and in round 4, those from round 3. A total
of four rounds is to be run.
The analysis therefore involves the following factors:
1. Size of the training database (expressed as the number of spam
messages used to build it) at the start of the round (before
training). This increases from round to round.
2. Training on the test messages (before training, after supervised
training, and after unsupervised training).
3. Method of classification (Graham and Robinson).
4. Run (which of the three message files is being processed). It's
hoped that this will not be a significant factor, so we can treat
runs as replication.
We want to know whether any or all of these factors significantly
affect the performance of bogofilter in classifying the test messages.
It would seem reasonable to use the rate of errors (number of false
negatives plus number of false positives, divided by total number of
messages) as the performance index. However, one could also argue for
computing an index that is the percentage of false negatives plus the
percentage of false positives (with a "penalty" multiplier since false
positives are far more undesirable than false negatives). In the
interest of simplicity and a closer approximation to normal
distribution, I chose to use the error rate.
It's likely that the rate of error may vary among runs, just because of
random variation in the nature of the messages making up the spam and
nonspam test corpora. It can be assumed that this variation will be
distributed normally, or at any rate sufficiently near to normally [1]
that the statistical technique known as analysis of variance (anova)
should give valid results. The error rates from the experiment were
therefore subjected to a factorial anova. Details appear in later
sections; readers not interested in the nitty-gritty may prefer to look
at the graph in file bogoGR.png and then skip to the Discussion. Here
beginneth that which is not for the faint of heart ;-)
[1] See Cochran, W. G. and Cox, G. M. (New York, John Wiley
and Sons, 1950): Experimental Designs, §3.9, for discussion
on this point. An analysis based on an inverse sine
transformation (not shown) gave results that were nearly
identical to those produced from the untransformed data.
Procedure:
"""""""""
1. Initially, 6106 emails were harvested from an active email server
with approximately 100 users. These messages were separated with
bogofilter into two mailboxes, one containing mails evaluated as
nonspam and the other containing the spam. The mailboxes were
reviewed by a human observer (me) and all classification errors
were corrected by transferring the affected emails to the right
mailbox. This process yielded 4512 nonspam messages and 1594
spams.
2. Program formail was used to distribute the messages:
FILENO=0 formail -s twelfths <corpus.good
This produced a set of twelve files which were renamed thus:
cgc1-0 cgc1-1 cgc1-2 cgc2-0 cgc2-1 cgc2-2
cgc3-0 cgc3-1 cgc3-2 cgc4-0 cgc4-1 cgc4-2
Similarly,
FILENO=0 formail -s twelfths <corpus.bad
produced files that were renamed cbc[1-4]-[0-2] analogously.
The "twelfths" script merely contained
#! /bin/sh
let n=${FILENO}%12
fname=cgx-$n
cat >>$fname
so the effect was to deal out the messages into the twelve
files as if we were dealing a deck of cards. The
naming convention was based on b for bad and g for good, with
the number before the dash representing the round and the
number after the dash representing the run.
For use in training (see below) the spam and nonspam messages
of each round were pooled:
for n in 1 2 3 4; do
cat cgc$n-? >t.good.$n
cat cbc$n-? >t.bad.$n
done
3. Three BOGODIR directories, named .bogovis, .bogorob and .bogogra,
were each populated with the same bogofilter training set. The
goodlist.db file had 5129 messages and the spamlist.db file had
4840. The experiment's first training round noticeably
improved performance; as the original training set was accumulated
on a different system from that on which the experiment was run,
this improvement was probably due to the messages in round 1 being
more typical of those seen in later rounds.
4. A script called maketable was used to build the R data frame
mentioned in step 3 of the Introduction, and to obtain a summary
printout. Unless it's desired to repeat the experiment, the reader
needn't study these scripts in detail; suffice it that the
output resembles the following:
Results for test series 25254, 509 messages including 133 spam:
Gra(s) fneg fpos Rob(s) fneg fpos Gra(u) fneg fpos Rob(u) fneg fpos
n 36 25 11 24 23 1 36 25 11 24 23 1
%0.0707 18.797 2.926 0.0472 17.293 0.266 0.0707 18.797 2.926 0.0472 17.293 0.266
The error rate appears under Gra() or Rob(); note that it's not
expressed as a percentage in the "%" line, but as an absolute
proportion. Then come false negatives and false positives. Letter
(s) indicates supervised training, while letter (u) means
unsupervised, aka automated or automatic. The maketable script follows:
#! /bin/bash
# make individual vectors
cd ~/bin
>../test/visg.$$ ; >../test/visr.$$
>../test/gra.$$ ; >../test/rob.$$
>../test/isspam.$$ ; >../test/indices.$$ indx=1; spam=Y
for corp in cb cg; do
formail -s bogovis -v -g <../mail/${corp}$1 | cut -c 13 >>../test/visg.$$ &
formail -s bogovis -v -r <../mail/${corp}$1 | cut -c 13 >>../test/visr.$$ &
formail -s bogogra -v -g <../mail/${corp}$1 | cut -c 13 >>../test/gra.$$ &
formail -s bogorob -v -r <../mail/${corp}$1 | cut -c 13 >>../test/rob.$$ &
wait
msgcount=`wc -l ../test/visg.$$ | awk '{print $1}'`
for n in `seq $indx $msgcount`; do
echo $n >>../test/indices.$$
echo $spam >>../test/isspam.$$
done
let indx=${msgcount}+1; spam="N"
done
# combine into an R data frame
cd ../test
echo " spam sG sR uG uR" >table.$$
pr -m -t -l 9999 -F indices.$$ isspam.$$ visg.$$ visr.$$ gra.$$ rob.$$ | \
awk '{printf("%4s%4s%4s%4s%4s%4s\n", $1, $2, $3, $4, $5, $6)}' | \
tr YN 01 >>table.$$
# expunge unneeded vector files
/bin/rm ../test/[!t]*.$$
# calculate percentage correctness
reportpercents $$
The reportpercents script looks like this:
#! /usr/bin/perl
$ext = $ARGV[0];
open(TBL, "table.$ext") || die("couldn't open table.$ext");
<TBL>;
$sumsupgra = $sumsuprob = $sumunsgra = $sumunsrob = $n = $ns = 0;
$fnsupgra = $fpsupgra = $fnsuprob = $fpsuprob = 0;
$fnunsgra = $fpunsgra = $fnunsrob = $fpunsrob = 0;
while (<TBL>) {
($indx, $spam, $supgra, $suprob, $unsgra, $unsrob) = split;
$n++; $ns++ if($spam == 0);
$sumsupgra++ if($supgra != $spam);
$sumsuprob++ if($suprob != $spam);
$sumunsgra++ if($unsgra != $spam);
$sumunsrob++ if($unsrob != $spam);
$fnsupgra++ if($supgra > $spam);
$fpsupgra++ if($supgra < $spam);
$fnsuprob++ if($suprob > $spam);
$fpsuprob++ if($suprob < $spam);
$fnunsgra++ if($unsgra > $spam);
$fpunsgra++ if($unsgra < $spam);
$fnunsrob++ if($unsrob > $spam);
$fpunsrob++ if($unsrob < $spam);
}
close TBL;
printf("Results for test series $ext, %d messages including %d spam:\n",$n,$ns);
print " Gra(s) fneg fpos Rob(s) fneg fpos Gra(u) fneg fpos Rob(u) fneg fpos\n";
printf("n%6d%7d%6d%7d%7d%6d%7d%7d%6d%7d%7d%6d\n",
$sumsupgra, $fnsupgra, $fpsupgra, $sumsuprob, $fnsuprob, $fpsuprob,
$sumunsgra, $fnunsgra, $fpunsgra, $sumunsrob, $fnunsrob, $fpunsrob);
printf("%%%6.4f%7.3f%6.3f%7.4f%7.3f%6.3f%7.4f%7.3f%6.3f%7.4f%7.3f%6.3f\n",
$sumsupgra / $n, 100 * $fnsupgra / $ns, 100 * $fpsupgra / ($n-$ns),
$sumsuprob / $n, 100 * $fnsuprob / $ns, 100 * $fpsuprob / ($n-$ns),
$sumunsgra / $n, 100 * $fnunsgra / $ns, 100 * $fpunsgra / ($n-$ns),
$sumunsrob / $n, 100 * $fnunsrob / $ns, 100 * $fpunsrob / ($n-$ns));
For training within rounds, a script called train was invoked:
if [ "x$1" = "x" ]; then
echo "Usage: train n (n is the round number)"
exit 1
else
echo "Training on round $1:"
fi
echo -n "Supervised training, bad... "
formail -s bogovis -s -r <~/mail/t.bad.$1
echo done
echo -n "Supervised training, good... "
formail -s bogovis -n -r <~/mail/t.good.$1
echo done
echo -n "Automated training, Robinson, bad... "
formail -s bogorob -u -r <~/mail/t.bad.$1
echo done
echo -n "Automated training, Robinson, good... "
formail -s bogorob -u -r <~/mail/t.good.$1
echo done
echo -n "Automated training, Graham, bad... "
formail -s bogogra -u <~/mail/t.bad.$1
echo done
echo -n "Automated training, Graham, good... "
formail -s bogogra -u <~/mail/t.good.$1
echo "done, copying... "
for dir in ~/.bogovis ~/.bogorob ~/.bogogra; do
for file in goodlist.db spamlist.db; do
cp $dir/$file $dir$file.$1
done
done
echo done
echo "Training complete."
In the above, commands bogovis, bogorob and bogogra performed the
equivalent of "bogofilter -d ~/.bogovis" and so on.
5. For each round of the experiment, these tools were applied as follows:
Let the number of the round be represented by X (1, 2, 3 or 4); then
for table in cX-0 cX-1 cX-2; do maketable $table; done
train X
for table in cX-0 cX-1 cX-2; do maketable $table; done
cp ~/.bogovis/*.db ~/.bogorob
cp ~/.bogovis/*.db ~/.bogogra
The last two commands ensure that each round starts from the
training set produced with supervision; we did not want training
errors produced by automated training to accumulate. As the
results appeared, they were entered (cut and pasted) into two
files, one showing the reportpercent script's output, the other an
R data frame with the error rates (the format of the R data frame
appears in the Results section below).
6. The data frame mentioned in the preceding paragraph served as input
to the following R script:
if(length(grep("^outfile$",ls(),value=TRUE)) > 0) sink(outfile)
read.table("bogoGR.tab") -> bogo
bogo$Spamlist <- factor(bogo$Spamlist)
bogo$Training <- factor(bogo$Training)
bogo$Method <- factor(bogo$Method)
bogo$Run <- factor(bogo$Run)
attach(bogo)
print(bogo)
bogaov <- aov(Error ~ Spamlist + Training + Method + Run, data=bogo)
print(summary(bogaov))
replaov <- aov(Error ~ Spamlist + Training + Method + Spamlist * Training
+ Spamlist * Method + Training * Method, data=bogo)
print(summary(replaov))
print(TukeyHSD(replaov),digits=4)
d <- c(1.95996, 0.412, 0.423)
rn <- length(residuals(replaov))
rdf <- rn - rn/6 - 6
rms <- deviance(replaov) / rdf
n <- 3
errors <- array(bogo$Error, dim=c(3,rn/3))
meanerr <- apply(errors, 2, mean)
z <- (d[1] + 1 / (rdf * d[2] - d[3])) * sqrt(rms / n)
lcl <- pmax(0,meanerr - z)
ucl <- meanerr + z
MeanErr <- round(meanerr,digits=5)
LCL <- round(lcl,digits=5)
UCL <- round(ucl,digits=5)
lbls <- c("G1","R1","Gs1","Rs1","Ga1","Ra1")
if(length(lbls)<rn/3) lbls <- c(lbls,"G2","R2","Gs2","Rs2","Ga2","Ra2")
if(length(lbls)<rn/3) lbls <- c(lbls,"G3","R3","Gs3","Rs3","Ga3","Ra3")
if(length(lbls)<rn/3) lbls <- c(lbls,"G4","R4","Gs4","Rs4","Ga4","Ra4")
if(length(lbls)<rn/3) lbls <- c(lbls,"G5","R5","Gs5","Rs5","Ga5","Ra5")
data.frame(MeanErr,LCL,UCL,row.names=lbls) -> mcl
print(mcl)
plot(meanerr,ylim=c(0,0.1),ylab="Error Rate",xlab="",axes=FALSE)
axis(2)
axis(1,at=1:(rn/3),labels=lbls)
points(ucl,pch="-")
points(lcl,pch="-")
lines(ucl,type="h")
lines(lcl,type="h",col="white")
text(rn*13/54,0.097,sprintf("Bogofilter test, rounds 1-%0.0f",rn/18))
sink()
This R script (called bogaov.R) was invoked like this:
outfile <- "bogoGR.sess"
source("bogaov.R", echo=TRUE)
7. The plot produced by bogaov.R was saved:
import bogoGR.png
(click on the plot with the mouse)
Results:
"""""""
1. Here is the output of the percentreport runs:
Combined corpora: bogofilter test
Round 1
Before training:
$ for table in c1-0 c1-1 c1-2; do maketable $table; done
Results for test series 25254, 509 messages including 133 spam:
Gra(s) fneg fpos Rob(s) fneg fpos Gra(u) fneg fpos Rob(u) fneg fpos
n 36 25 11 24 23 1 36 25 11 24 23 1
%0.0707 18.797 2.926 0.0472 17.293 0.266 0.0707 18.797 2.926 0.0472 17.293 0.266
Results for test series 27329, 509 messages including 133 spam:
Gra(s) fneg fpos Rob(s) fneg fpos Gra(u) fneg fpos Rob(u) fneg fpos
n 37 25 12 25 24 1 37 25 12 25 24 1
%0.0727 18.797 3.191 0.0491 18.045 0.266 0.0727 18.797 3.191 0.0491 18.045 0.266
Results for test series 29411, 509 messages including 133 spam:
Gra(s) fneg fpos Rob(s) fneg fpos Gra(u) fneg fpos Rob(u) fneg fpos
n 37 29 8 32 31 1 37 29 8 32 31 1
%0.0727 21.805 2.128 0.0629 23.308 0.266 0.0727 21.805 2.128 0.0629 23.308 0.266
After training:
$ for table in c1-0 c1-1 c1-2; do maketable $table; done
Results for test series 3892, 509 messages including 133 spam:
Gra(s) fneg fpos Rob(s) fneg fpos Gra(u) fneg fpos Rob(u) fneg fpos
n 21 21 0 1 1 0 29 18 11 15 13 2
%0.0413 15.789 0.000 0.0020 0.752 0.000 0.0570 13.534 2.926 0.0295 9.774 0.532
Results for test series 5961, 509 messages including 133 spam:
Gra(s) fneg fpos Rob(s) fneg fpos Gra(u) fneg fpos Rob(u) fneg fpos
n 20 20 0 3 3 0 35 22 13 21 18 3
%0.0393 15.038 0.000 0.0059 2.256 0.000 0.0688 16.541 3.457 0.0413 13.534 0.798
Results for test series 8030, 509 messages including 133 spam:
Gra(s) fneg fpos Rob(s) fneg fpos Gra(u) fneg fpos Rob(u) fneg fpos
n 29 29 0 1 1 0 33 25 8 18 16 2
%0.0570 21.805 0.000 0.0020 0.752 0.000 0.0648 18.797 2.128 0.0354 12.030 0.532
Round 2
Before training:
$ for table in c2-0 c2-1 c2-2; do maketable $table; done
Results for test series 16426, 509 messages including 133 spam:
Gra(s) fneg fpos Rob(s) fneg fpos Gra(u) fneg fpos Rob(u) fneg fpos
n 27 24 3 15 14 1 27 24 3 15 14 1
%0.0530 18.045 0.798 0.0295 10.526 0.266 0.0530 18.045 0.798 0.0295 10.526 0.266
Results for test series 18494, 509 messages including 133 spam:
Gra(s) fneg fpos Rob(s) fneg fpos Gra(u) fneg fpos Rob(u) fneg fpos
n 22 20 2 13 12 1 22 20 2 13 12 1
%0.0432 15.038 0.532 0.0255 9.023 0.266 0.0432 15.038 0.532 0.0255 9.023 0.266
Results for test series 20563, 509 messages including 133 spam:
Gra(s) fneg fpos Rob(s) fneg fpos Gra(u) fneg fpos Rob(u) fneg fpos
n 27 22 5 22 21 1 27 22 5 22 21 1
%0.0530 16.541 1.330 0.0432 15.789 0.266 0.0530 16.541 1.330 0.0432 15.789 0.266
After training:
Results for test series 27432, 509 messages including 133 spam:
Gra(s) fneg fpos Rob(s) fneg fpos Gra(u) fneg fpos Rob(u) fneg fpos
n 20 20 0 2 2 0 27 24 3 12 9 3
%0.0393 15.038 0.000 0.0039 1.504 0.000 0.0530 18.045 0.798 0.0236 6.767 0.798
Results for test series 29501, 509 messages including 133 spam:
Gra(s) fneg fpos Rob(s) fneg fpos Gra(u) fneg fpos Rob(u) fneg fpos
n 13 13 0 1 1 0 21 18 3 12 11 1
%0.0255 9.774 0.000 0.0020 0.752 0.000 0.0413 13.534 0.798 0.0236 8.271 0.266
Results for test series 31569, 509 messages including 133 spam:
Gra(s) fneg fpos Rob(s) fneg fpos Gra(u) fneg fpos Rob(u) fneg fpos
n 17 17 0 3 3 0 23 18 5 21 17 4
%0.0334 12.782 0.000 0.0059 2.256 0.000 0.0452 13.534 1.330 0.0413 12.782 1.064
Round 3
Before training:
$ for table in c3-0 c3-1 c3-2; do maketable $table; done
Results for test series 1192, 509 messages including 133 spam:
Gra(s) fneg fpos Rob(s) fneg fpos Gra(u) fneg fpos Rob(u) fneg fpos
n 26 24 2 15 14 1 26 24 2 15 14 1
%0.0511 18.045 0.532 0.0295 10.526 0.266 0.0511 18.045 0.532 0.0295 10.526 0.266
Results for test series 3261, 509 messages including 133 spam:
Gra(s) fneg fpos Rob(s) fneg fpos Gra(u) fneg fpos Rob(u) fneg fpos
n 34 31 3 24 24 0 34 31 3 24 24 0
%0.0668 23.308 0.798 0.0472 18.045 0.000 0.0668 23.308 0.798 0.0472 18.045 0.000
Results for test series 5330, 509 messages including 133 spam:
Gra(s) fneg fpos Rob(s) fneg fpos Gra(u) fneg fpos Rob(u) fneg fpos
n 30 27 3 24 22 2 30 27 3 24 22 2
%0.0589 20.301 0.798 0.0472 16.541 0.532 0.0589 20.301 0.798 0.0472 16.541 0.532
After training:
$ for table in c3-0 c3-1 c3-2; do maketable $table; done
Results for test series 12183, 509 messages including 133 spam:
Gra(s) fneg fpos Rob(s) fneg fpos Gra(u) fneg fpos Rob(u) fneg fpos
n 16 16 0 1 1 0 24 22 2 13 12 1
%0.0314 12.030 0.000 0.0020 0.752 0.000 0.0472 16.541 0.532 0.0255 9.023 0.266
Results for test series 14264, 509 messages including 133 spam:
Gra(s) fneg fpos Rob(s) fneg fpos Gra(u) fneg fpos Rob(u) fneg fpos
n 22 22 0 2 2 0 31 28 3 22 21 1
%0.0432 16.541 0.000 0.0039 1.504 0.000 0.0609 21.053 0.798 0.0432 15.789 0.266
Results for test series 16333, 509 messages including 133 spam:
Gra(s) fneg fpos Rob(s) fneg fpos Gra(u) fneg fpos Rob(u) fneg fpos
n 24 24 0 5 5 0 27 24 3 23 20 3
%0.0472 18.045 0.000 0.0098 3.759 0.000 0.0530 18.045 0.798 0.0452 15.038 0.798
Round 4
Before training:
$ for table in c4-0 c4-1 c4-2; do maketable $table; done
Results for test series 18451, 509 messages including 133 spam:
Gra(s) fneg fpos Rob(s) fneg fpos Gra(u) fneg fpos Rob(u) fneg fpos
n 29 25 4 18 16 2 29 25 4 18 16 2
%0.0570 18.797 1.064 0.0354 12.030 0.532 0.0570 18.797 1.064 0.0354 12.030 0.532
Results for test series 20522, 508 messages including 132 spam:
Gra(s) fneg fpos Rob(s) fneg fpos Gra(u) fneg fpos Rob(u) fneg fpos
n 26 21 5 19 17 2 26 21 5 19 17 2
%0.0512 15.909 1.330 0.0374 12.879 0.532 0.0512 15.909 1.330 0.0374 12.879 0.532
Results for test series 22587, 508 messages including 132 spam:
Gra(s) fneg fpos Rob(s) fneg fpos Gra(u) fneg fpos Rob(u) fneg fpos
n 27 23 4 21 19 2 27 23 4 21 19 2
%0.0531 17.424 1.064 0.0413 14.394 0.532 0.0531 17.424 1.064 0.0413 14.394 0.532
After training:
$ for table in c4-0 c4-1 c4-2; do maketable $table; done
Results for test series 29488, 509 messages including 133 spam:
Gra(s) fneg fpos Rob(s) fneg fpos Gra(u) fneg fpos Rob(u) fneg fpos
n 20 20 0 6 5 1 25 21 4 16 14 2
%0.0393 15.038 0.000 0.0118 3.759 0.266 0.0491 15.789 1.064 0.0314 10.526 0.532
Results for test series 31574, 508 messages including 132 spam:
Gra(s) fneg fpos Rob(s) fneg fpos Gra(u) fneg fpos Rob(u) fneg fpos
n 17 16 1 4 3 1 25 20 5 18 15 3
%0.0335 12.121 0.266 0.0079 2.273 0.266 0.0492 15.152 1.330 0.0354 11.364 0.798
Results for test series 1172, 508 messages including 132 spam:
Gra(s) fneg fpos Rob(s) fneg fpos Gra(u) fneg fpos Rob(u) fneg fpos
n 20 19 1 5 4 1 25 21 4 22 19 3
%0.0394 14.394 0.266 0.0098 3.030 0.266 0.0492 15.909 1.064 0.0433 14.394 0.798
2. Here (slightly abbreviated) are the contents of the R session:
> bogo <- read.table("bogoGR.tab")
> bogo$Spamlist <- factor(bogo$Spamlist)
> bogo$Training <- factor(bogo$Training)
> bogo$Method <- factor(bogo$Method)
> bogo$Run <- factor(bogo$Run)
> attach(bogo)
> print(bogo)
Spamlist Training Method Run Error Spamlist Training Method Run Error
1 4840 Pre Graham 1 0.0707 37 5638 Pre Graham 1 0.0511
2 4840 Pre Graham 2 0.0727 38 5638 Pre Graham 2 0.0668
3 4840 Pre Graham 3 0.0727 39 5638 Pre Graham 3 0.0589
4 4840 Pre Robinson 1 0.0472 40 5638 Pre Robinson 1 0.0295
5 4840 Pre Robinson 2 0.0491 41 5638 Pre Robinson 2 0.0472
6 4840 Pre Robinson 3 0.0629 42 5638 Pre Robinson 3 0.0472
7 4840 Supervised Graham 1 0.0413 43 5638 Supervised Graham 1 0.0314
8 4840 Supervised Graham 2 0.0393 44 5638 Supervised Graham 2 0.0432
9 4840 Supervised Graham 3 0.0570 45 5638 Supervised Graham 3 0.0472
10 4840 Supervised Robinson 1 0.0020 46 5638 Supervised Robinson 1 0.0020
11 4840 Supervised Robinson 2 0.0059 47 5638 Supervised Robinson 2 0.0039
12 4840 Supervised Robinson 3 0.0020 48 5638 Supervised Robinson 3 0.0098
13 4840 Automated Graham 1 0.0570 49 5638 Automated Graham 1 0.0472
14 4840 Automated Graham 2 0.0688 50 5638 Automated Graham 2 0.0609
15 4840 Automated Graham 3 0.0648 51 5638 Automated Graham 3 0.0530
16 4840 Automated Robinson 1 0.0295 52 5638 Automated Robinson 1 0.0255
17 4840 Automated Robinson 2 0.0413 53 5638 Automated Robinson 2 0.0432
18 4840 Automated Robinson 3 0.0354 54 5638 Automated Robinson 3 0.0452
19 5239 Pre Graham 1 0.0530 55 6037 Pre Graham 1 0.0570
20 5239 Pre Graham 2 0.0432 56 6037 Pre Graham 2 0.0512
21 5239 Pre Graham 3 0.0530 57 6037 Pre Graham 3 0.0531
22 5239 Pre Robinson 1 0.0295 58 6037 Pre Robinson 1 0.0354
23 5239 Pre Robinson 2 0.0255 59 6037 Pre Robinson 2 0.0374
24 5239 Pre Robinson 3 0.0432 60 6037 Pre Robinson 3 0.0413
25 5239 Supervised Graham 1 0.0393 61 6037 Supervised Graham 1 0.0393
26 5239 Supervised Graham 2 0.0255 62 6037 Supervised Graham 2 0.0335
27 5239 Supervised Graham 3 0.0334 63 6037 Supervised Graham 3 0.0394
28 5239 Supervised Robinson 1 0.0039 64 6037 Supervised Robinson 1 0.0118
29 5239 Supervised Robinson 2 0.0020 65 6037 Supervised Robinson 2 0.0079
30 5239 Supervised Robinson 3 0.0059 66 6037 Supervised Robinson 3 0.0098
31 5239 Automated Graham 1 0.0530 67 6037 Automated Graham 1 0.0491
32 5239 Automated Graham 2 0.0413 68 6037 Automated Graham 2 0.0492
33 5239 Automated Graham 3 0.0452 69 6037 Automated Graham 3 0.0492
34 5239 Automated Robinson 1 0.0296 70 6037 Automated Robinson 1 0.0314
35 5239 Automated Robinson 2 0.0236 71 6037 Automated Robinson 2 0.0354
36 5239 Automated Robinson 3 0.0413 72 6037 Automated Robinson 3 0.0433
> bogaov <- aov(Error ~ Spamlist + Training + Method +
Run, data = bogo)
> print(summary(bogaov))
Df Sum Sq Mean Sq F value Pr(>F)
Spamlist 3 0.0014951 0.0004984 8.8808 5.412e-05 ***
Training 2 0.0101961 0.0050981 90.8453 < 2.2e-16 ***
Method 1 0.0094508 0.0094508 168.4094 < 2.2e-16 ***
Run 2 0.0004673 0.0002336 4.1631 0.02004 *
Residuals 63 0.0035354 0.0000561
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
> replaov <- aov(Error ~ Spamlist + Training + Method +
Spamlist * Training + Spamlist * Method + Training * Method,
data = bogo)
> print(summary(replaov))
Df Sum Sq Mean Sq F value Pr(>F)
Spamlist 3 0.0014951 0.0004984 12.5747 2.396e-06 ***
Training 2 0.0101961 0.0050981 128.6322 < 2.2e-16 ***
Method 1 0.0094508 0.0094508 238.4589 < 2.2e-16 ***
Spamlist:Training 6 0.0005042 0.0000840 2.1203 0.06577 .
Spamlist:Method 3 0.0003346 0.0001115 2.8145 0.04778 *
Training:Method 2 0.0010237 0.0005118 12.9145 2.608e-05 ***
Residuals 54 0.0021402 0.0000396
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
> print(TukeyHSD(replaov), digits = 4)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Error ~ Spamlist + Training + Method + Spamlist *
Training + Spamlist * Method + Training * Method, data = bogo)
$Spamlist
diff lwr upr
5239-4840 -0.012678 -0.018241 -0.0071149
5638-4840 -0.005911 -0.011474 -0.0003483
6037-4840 -0.008050 -0.013613 -0.0024872
5638-5239 0.006767 0.001204 0.0123295
6037-5239 0.004628 -0.000935 0.0101906
6037-5638 -0.002139 -0.007702 0.0034239
$Training
diff lwr upr
Pre-Automated 0.005642 0.001262 0.01002
Supervised-Automated -0.021946 -0.026326 -0.01757
Supervised-Pre -0.027587 -0.031967 -0.02321
$Method
diff lwr upr
Robinson-Graham -0.02291 -0.02589 -0.01994
(Interaction printout elided)
> d <- c(1.95996, 0.412, 0.423)
> rn <- length(residuals(replaov))
> rdf <- rn - rn/6 - 6
> rms <- deviance(replaov)/rdf
> n <- 3
> errors <- array(bogo$Error, dim = c(3, rn/3))
> meanerr <- apply(errors, 2, mean)
> z <- (d[1] + 1/(rdf * d[2] - d[3])) * sqrt(rms/n)
> lcl <- pmax(0, meanerr - z)
> ucl <- meanerr + z
> MeanErr <- round(meanerr, digits = 5)
> LCL <- round(lcl, digits = 5)
> UCL <- round(ucl, digits = 5)
> lbls <- c("G1", "R1", "Gs1", "Rs1", "Ga1", "Ra1")
> if (length(lbls) < rn/3) lbls <- c(lbls, "G2", "R2",
"Gs2", "Rs2", "Ga2", "Ra2")
> if (length(lbls) < rn/3) lbls <- c(lbls, "G3", "R3",
"Gs3", "Rs3", "Ga3", "Ra3")
> if (length(lbls) < rn/3) lbls <- c(lbls, "G4", "R4",
"Gs4", "Rs4", "Ga4", "Ra4")
> if (length(lbls) < rn/3) lbls <- c(lbls, "G5", "R5",
"Gs5", "Rs5", "Ga5", "Ra5")
> mcl <- data.frame(MeanErr, LCL, UCL, row.names = lbls)
> print(mcl)
MeanErr LCL UCL
G1 0.07203 0.06474 0.07932
R1 0.05307 0.04578 0.06036
Gs1 0.04587 0.03858 0.05316
Rs1 0.00330 0.00000 0.01059
Ga1 0.06353 0.05624 0.07082
Ra1 0.03540 0.02811 0.04269
G2 0.04973 0.04244 0.05702
R2 0.03273 0.02544 0.04002
Gs2 0.03273 0.02544 0.04002
Rs2 0.00393 0.00000 0.01122
Ga2 0.04650 0.03921 0.05379
Ra2 0.03150 0.02421 0.03879
G3 0.05893 0.05164 0.06622
R3 0.04130 0.03401 0.04859
Gs3 0.04060 0.03331 0.04789
Rs3 0.00523 0.00000 0.01252
Ga3 0.05370 0.04641 0.06099
Ra3 0.03797 0.03068 0.04526
G4 0.05377 0.04648 0.06106
R4 0.03803 0.03074 0.04532
Gs4 0.03740 0.03011 0.04469
Rs4 0.00983 0.00254 0.01712
Ga4 0.04917 0.04188 0.05646
Ra4 0.03670 0.02941 0.04399
> plot(meanerr, ylim = c(0, 0.08), ylab = "Error Rate",
xlab = "", axes = FALSE)
> axis(2)
> axis(1, at = 1:(rn/3), labels = lbls)
> points(ucl, pch = "-")
> points(lcl, pch = "-")
> lines(ucl, type = "h")
> lines(lcl, type = "h", col = "white")
> text(rn * 13/54, 0.078, sprintf("Bogofilter test, rounds 1-%0.0f",
rn/18))
> sink()
The accompanying file bogoGR.png contains a plot of the results. It
shows mean and 95% confidence limits of the error rate for each method
(G and R), before training and after supervised or automated training
(s and a), for each round (1-4).
Discussion:
""""""""""
The effect of factor Run (ie, difference caused by whether it was the
first, second or third group of messages that was being processed) was
small enough to be dismissed; the runs were treated as simple replicas.
The observations that follow are apparent from the graph and are all
confirmed by the analysis:
1. Robinson's method of calculation had a generally lower error rate
than did Graham's. (Compare Gx with Rx where x is 1, 2, 3 or 4.)
Applying Robinson's calculation method, in this experiment,
produced a diminution of between 0.0199 and 0.0259 in the overall
error rate (those are 95% confidence limits; the mean was 0.0229)
with respect to that achieved by Graham's method. Since Graham's
error rate averaged 0.0586 in this experiment, that means that the
Robinson method performed between 34% and 44% better than did the
Graham method. This effect is highly significant, both
statistically and practically.
2. The effect of spamlist size (ie training in general) was highly
significant; the second, third and fourth rounds all showed lower
error rates than the first. After the first round however, the
training effect seen from round to round was small (within the
level of random variation). This is unsurprising, as the first
round's training served to render the training database
significantly more typical of the population of messages to be
evaluated in subsequent rounds, and the training database was only
growing by about 8% per round. A general tendency to improve with
further training might exist, but this experiment was too brief to
demonstrate it. There seemed to be little interaction between
spamlist size and either of the other factors.
3. When the data for a round were added to the training set and then
re-evaluated, the effect of type of training (automated or
supervised) was also highly significant; as expected, supervised
training had a much better effect on performance than automated
training had. Letting bogofilter feed its own decisions into its
training set seems not to be an effective way to train, unless the
-S/-N options are used to correct its errors. (Personally, I
prefer to have bogofilter put what it thinks is spam in a separate
folder, which I periodically review. To that folder I also
transfer any false negatives that appear among the regular email.
Then the regular archive and the spam folder are added with -n and
-s so the training set never sees bad data -- well, hardly ever ;-)
The improvement in bogofilter's performance when the training data
were re-evaluated after supervised training was particularly
dramatic with Robinson's method of calculation, which takes into
account all the tokens in the message rather than just the most
characteristic ones. (Compare Gx with Gsx and Rx with Rsx where x
is 1, 2, 3 or 4.)
This significant training effect means that re-evaluating training
data should not be used to test bogofilter's performance, since
such tests will have no value in predicting performance in
production, where most messages are new.
4. Although generally beneficial (when errors are not allowed to
accumulate), automatic training is inferior to supervised training
in reducing the error rate when training data are re-evaluated.
This too is unsurprising, as automatic training means that any
errors made prior to training are entered into the training set
and reinforce the likelihood of making the same error again.
Conclusions:
"""""""""""
We can respond to the questions posed in the introduction as follows:
1. The results of this experiment indicate that Robinson's
method of calculation is more likely to yield correct results than
is the original method proposed by Graham.
2. If a test set of messages is added to the training set and then
re-evaluated, a significant learning effect is seen. This effect
is greater if the test set is manually corrected before addition.
[(C) Greg Louis, 2002; last modified 2002-11-03]
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bogoGR.png
Type: image/png
Size: 6603 bytes
Desc: not available
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20021103/4ceede64/attachment.png>
More information about the Bogofilter
mailing list