Comparing Graham's and Robinson's calculation methods (LONG)

Sun Nov 3 19:07:09 CET 2002

An html version of the following paper may be found at
http://www.bgl.nu/~glouis/bogofilter/test6000.html

Greg Louis
=============================================================

Testing bogofilter's calculation methods

Introduction and general description:
""""""""""""""""""""""""""""""""""""

The original version of bogofilter uses the computation method
presented in Paul Graham's paper A Plan for Spam
(http://www.paulgraham.com/spam.html).  Gary Robinson took an interest
in Graham's paper, and wrote an insightful commentary
(http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html)
in which he presented several untested suggestions for improvements to
Graham's method.  I was intrigued, and modified bogofilter 0.7 (and
subsequently 0.7.4 and 0.7.5) to try them out.  Initial tests
(http://www.bgl.nu/~glouis/bogofilter) looked promising.  Discussion
with David Relson (the developer who integrated my modifications into
bogofilter 0.7.6, and to whom I'm also grateful for reviewing a draft
of this paper) led to agreement that further testing would be
worthwhile; we were interested in answering (more completely) the
following two questions:

Does the Robinson method of calculating "spamicity" give better results
than the original method proposed by Paul Graham?

What is the effect on discrimination if a test set is added to the
training set and then re-evaluated?  Can the learning effect be seen? 
Is there a significant difference if the test set is manually
classified, or does automatically updating the training set based on
bogofilter's own decision suffice to ensure continuing good results?

It seemed to me that something like the following experimental design
would give us a chance at getting the answers:

1.  Set up initial training databases for Graham, Robinson and
    supervised training; these will be identical at the outset, and
    will be based on previous supervised training.
2.  Accumulate messages and classify by hand.
3.  Create a data frame of                                         
    message-ordinal  is-spam  supervised-Graham-says  \
    supervised-Robinson-says  unsupervised-Graham-says \
    unsupervised-Robinson-says
4.  Calculate percentage correct for Graham and Robinson, supervised
    and unsupervised.
5.  Add message group to training set based on training criterion:
    supervised - use is-spam
    unsupervised - update Graham database from
      unsupervised-Graham-says, and Robinson database from
      unsupervised-Robinson-says
6.  Repeat steps 3 and 4 to see if there is a learning effect.
6.  Repeat steps 2-6 till 4 rounds have been completed.

Upon reflection, it seemed worthwhile to add replication to this
procedure, so I adopted the following changes:

o  As step 1, accumulate approximately 6000 messages and classify by
    hand; then divide the spam and nonspam corpora into four groups of
    3, one group per round.  In each round, carry out steps 3-4 for
    each third separately (call these runs 0-2 for the round).

o  As step 2, set up initial training databases for Graham, Robinson
    and supervised training; these will be identical at the outset of
    each round, and will be based on supervised training.  Start each
    round with step 2; for round 2, use the training sets
    produced by supervised training in round 1; in round 3, those
    produced in round 2; and in round 4, those from round 3.  A total
    of four rounds is to be run.

The analysis therefore involves the following factors:

1.  Size of the training database (expressed as the number of spam
    messages used to build it) at the start of the round (before
    training).  This increases from round to round.

2.  Training on the test messages (before training, after supervised
    training, and after unsupervised training).

3.  Method of classification (Graham and Robinson).

4.  Run (which of the three message files is being processed).  It's
    hoped that this will not be a significant factor, so we can treat
    runs as replication.

We want to know whether any or all of these factors significantly
affect the performance of bogofilter in classifying the test messages. 
It would seem reasonable to use the rate of errors (number of false
negatives plus number of false positives, divided by total number of
messages) as the performance index.  However, one could also argue for
computing an index that is the percentage of false negatives plus the
percentage of false positives (with a "penalty" multiplier since false
positives are far more undesirable than false negatives).  In the
interest of simplicity and a closer approximation to normal
distribution, I chose to use the error rate.

It's likely that the rate of error may vary among runs, just because of
random variation in the nature of the messages making up the spam and
nonspam test corpora.  It can be assumed that this variation will be
distributed normally, or at any rate sufficiently near to normally [1]
that the statistical technique known as analysis of variance (anova)
should give valid results.  The error rates from the experiment were
therefore subjected to a factorial anova.  Details appear in later
sections; readers not interested in the nitty-gritty may prefer to look
at the graph in file bogoGR.png and then skip to the Discussion.  Here
beginneth that which is not for the faint of heart ;-)

    [1] See Cochran, W. G. and Cox, G. M. (New York, John Wiley
    and Sons, 1950): Experimental Designs, §3.9, for discussion
    on this point.  An analysis based on an inverse sine
    transformation (not shown) gave results that were nearly
    identical to those produced from the untransformed data.

Procedure:
"""""""""
1.  Initially, 6106 emails were harvested from an active email server
    with approximately 100 users.  These messages were separated with
    bogofilter into two mailboxes, one containing mails evaluated as
    nonspam and the other containing the spam.  The mailboxes were
    reviewed by a human observer (me) and all classification errors
    were corrected by transferring the affected emails to the right
    mailbox.  This process yielded 4512 nonspam messages and 1594
    spams.

2.  Program formail was used to distribute the messages:

      FILENO=0 formail -s twelfths <corpus.good

      This produced a set of twelve files which were renamed thus:

      cgc1-0  cgc1-1  cgc1-2    cgc2-0  cgc2-1  cgc2-2
      cgc3-0  cgc3-1  cgc3-2    cgc4-0  cgc4-1  cgc4-2

      Similarly,

      FILENO=0 formail -s twelfths <corpus.bad

      produced files that were renamed cbc[1-4]-[0-2] analogously.
      The "twelfths" script merely contained

	#! /bin/sh
	let n=${FILENO}%12
	fname=cgx-$n
	cat >>$fname

      so the effect was to deal out the messages into the twelve
      files as if we were dealing a deck of cards.  The
      naming convention was based on b for bad and g for good, with
      the number before the dash representing the round and the
      number after the dash representing the run.

    For use in training (see below) the spam and nonspam messages
    of each round were pooled:
	for n in 1 2 3 4; do
	    cat cgc$n-? >t.good.$n
	    cat cbc$n-? >t.bad.$n
	done

3.  Three BOGODIR directories, named .bogovis, .bogorob and .bogogra,
    were each populated with the same bogofilter training set.  The
    goodlist.db file had 5129 messages and the spamlist.db file had
    4840.  The experiment's first training round noticeably   
    improved performance; as the original training set was accumulated
    on a different system from that on which the experiment was run,  
    this improvement was probably due to the messages in round 1 being
    more typical of those seen in later rounds.

4.  A script called maketable was used to build the R data frame
    mentioned in step 3 of the Introduction, and to obtain a summary
    printout.  Unless it's desired to repeat the experiment, the reader
    needn't study these scripts in detail; suffice it that the
    output resembles the following:

	Results for test series 25254, 509 messages including 133 spam:
	  Gra(s)  fneg  fpos  Rob(s)  fneg  fpos  Gra(u)  fneg  fpos  Rob(u)  fneg  fpos
	n    36     25    11     24     23     1     36     25    11     24     23     1
	%0.0707 18.797 2.926 0.0472 17.293 0.266 0.0707 18.797 2.926 0.0472 17.293 0.266

    The error rate appears under Gra() or Rob(); note that it's not
    expressed as a percentage in the "%" line, but as an absolute
    proportion.  Then come false negatives and false positives.  Letter
    (s) indicates supervised training, while letter (u) means
    unsupervised, aka automated or automatic.  The maketable script follows:

	#! /bin/bash
	#  make individual vectors
	cd ~/bin
	>../test/visg.$$ ; >../test/visr.$$
	>../test/gra.$$ ; >../test/rob.$$
	>../test/isspam.$$ ; >../test/indices.$$ indx=1; spam=Y
	for corp in cb cg; do
	    formail -s bogovis -v -g <../mail/${corp}$1 | cut -c 13 >>../test/visg.$$ &
	    formail -s bogovis -v -r <../mail/${corp}$1 | cut -c 13 >>../test/visr.$$ &
	    formail -s bogogra -v -g <../mail/${corp}$1 | cut -c 13 >>../test/gra.$$ &
	    formail -s bogorob -v -r <../mail/${corp}$1 | cut -c 13 >>../test/rob.$$ &
	    wait
	    msgcount=`wc -l ../test/visg.$$ | awk '{print $1}'`
	    for n in `seq $indx $msgcount`; do
		echo $n >>../test/indices.$$
		echo $spam >>../test/isspam.$$
	    done
	    let indx=${msgcount}+1; spam="N"
	done
	#  combine into an R data frame
	cd ../test
	echo "    spam  sG  sR  uG  uR" >table.$$
	pr -m -t -l 9999 -F indices.$$ isspam.$$ visg.$$ visr.$$ gra.$$ rob.$$ | \
	    awk '{printf("%4s%4s%4s%4s%4s%4s\n", $1, $2, $3, $4, $5, $6)}' | \
	    tr YN 01 >>table.$$
	#  expunge unneeded vector files
	/bin/rm ../test/[!t]*.$$
	#  calculate percentage correctness
	reportpercents $$

    The reportpercents script looks like this:

        #! /usr/bin/perl

        $ext = $ARGV[0];
        open(TBL, "table.$ext") || die("couldn't open table.$ext");
        <TBL>;
        $sumsupgra = $sumsuprob = $sumunsgra = $sumunsrob = $n = $ns = 0;
        $fnsupgra = $fpsupgra = $fnsuprob = $fpsuprob = 0;
        $fnunsgra = $fpunsgra = $fnunsrob = $fpunsrob = 0;
        while (<TBL>) {
            ($indx, $spam, $supgra, $suprob, $unsgra, $unsrob) = split;
            $n++; $ns++ if($spam == 0);
            $sumsupgra++ if($supgra != $spam);
            $sumsuprob++ if($suprob != $spam);
            $sumunsgra++ if($unsgra != $spam);
            $sumunsrob++ if($unsrob != $spam);
            $fnsupgra++ if($supgra > $spam);
            $fpsupgra++ if($supgra < $spam);
            $fnsuprob++ if($suprob > $spam);
            $fpsuprob++ if($suprob < $spam);
            $fnunsgra++ if($unsgra > $spam);
            $fpunsgra++ if($unsgra < $spam);
            $fnunsrob++ if($unsrob > $spam);
            $fpunsrob++ if($unsrob < $spam);
        }
        close TBL;
        printf("Results for test series $ext, %d messages including %d spam:\n",$n,$ns);
        print "  Gra(s)  fneg  fpos  Rob(s)  fneg  fpos  Gra(u)  fneg  fpos  Rob(u)  fneg  fpos\n";
        printf("n%6d%7d%6d%7d%7d%6d%7d%7d%6d%7d%7d%6d\n",
            $sumsupgra, $fnsupgra, $fpsupgra, $sumsuprob, $fnsuprob, $fpsuprob,
            $sumunsgra, $fnunsgra, $fpunsgra, $sumunsrob, $fnunsrob, $fpunsrob);
        printf("%%%6.4f%7.3f%6.3f%7.4f%7.3f%6.3f%7.4f%7.3f%6.3f%7.4f%7.3f%6.3f\n",
            $sumsupgra / $n, 100 * $fnsupgra / $ns, 100 * $fpsupgra / ($n-$ns),
            $sumsuprob / $n, 100 * $fnsuprob / $ns, 100 * $fpsuprob / ($n-$ns),
            $sumunsgra / $n, 100 * $fnunsgra / $ns, 100 * $fpunsgra / ($n-$ns),
            $sumunsrob / $n, 100 * $fnunsrob / $ns, 100 * $fpunsrob / ($n-$ns));

    For training within rounds, a script called train was invoked:

	if [ "x$1" = "x" ]; then
	    echo "Usage: train n (n is the round number)"
	    exit 1
	else
	    echo "Training on round $1:"
	fi
	echo -n "Supervised training, bad... "
	formail -s bogovis -s -r <~/mail/t.bad.$1
	echo done
	echo -n "Supervised training, good... "
	formail -s bogovis -n -r <~/mail/t.good.$1
	echo done
	echo -n "Automated training, Robinson, bad... "
	formail -s bogorob -u -r <~/mail/t.bad.$1
	echo done
	echo -n "Automated training, Robinson, good... "
	formail -s bogorob -u -r <~/mail/t.good.$1
	echo done
	echo -n "Automated training, Graham, bad... "
	formail -s bogogra -u <~/mail/t.bad.$1
	echo done
	echo -n "Automated training, Graham, good... "
	formail -s bogogra -u <~/mail/t.good.$1
	echo "done, copying... "
	for dir in ~/.bogovis ~/.bogorob ~/.bogogra; do
	    for file in goodlist.db spamlist.db; do
		cp $dir/$file $dir$file.$1
	    done
	done
	echo done
	echo "Training complete."

    In the above, commands bogovis, bogorob and bogogra performed the
    equivalent of "bogofilter -d ~/.bogovis" and so on.

5.  For each round of the experiment, these tools were applied as follows:
    Let the number of the round be represented by X (1, 2, 3 or 4); then

	for table in cX-0 cX-1 cX-2; do maketable $table; done
        train X
        for table in cX-0 cX-1 cX-2; do maketable $table; done
        cp ~/.bogovis/*.db ~/.bogorob
	cp ~/.bogovis/*.db ~/.bogogra

    The last two commands ensure that each round starts from the
    training set produced with supervision; we did not want training
    errors produced by automated training to accumulate.  As the
    results appeared, they were entered (cut and pasted) into two
    files, one showing the reportpercent script's output, the other an
    R data frame with the error rates (the format of the R data frame
    appears in the Results section below).

6.  The data frame mentioned in the preceding paragraph served as input
    to the following R script:

        if(length(grep("^outfile$",ls(),value=TRUE)) > 0) sink(outfile)
        read.table("bogoGR.tab") -> bogo
        bogo$Spamlist <- factor(bogo$Spamlist)
        bogo$Training <- factor(bogo$Training)
        bogo$Method <- factor(bogo$Method)
        bogo$Run <- factor(bogo$Run)
        attach(bogo)
        print(bogo)
        bogaov <- aov(Error ~ Spamlist + Training + Method + Run, data=bogo)
        print(summary(bogaov))
        replaov <- aov(Error ~ Spamlist + Training + Method + Spamlist * Training
            + Spamlist * Method + Training * Method, data=bogo)
        print(summary(replaov))
        print(TukeyHSD(replaov),digits=4)
        d <- c(1.95996, 0.412, 0.423)
        rn <- length(residuals(replaov))
        rdf <- rn - rn/6 - 6
        rms <- deviance(replaov) / rdf
        n <- 3
        errors <- array(bogo$Error, dim=c(3,rn/3))
        meanerr <- apply(errors, 2, mean)
        z <- (d[1] + 1 / (rdf * d[2] - d[3])) * sqrt(rms / n)
        lcl <- pmax(0,meanerr - z)
        ucl <- meanerr + z
        MeanErr <- round(meanerr,digits=5)
        LCL <- round(lcl,digits=5)
        UCL <- round(ucl,digits=5)
        lbls <- c("G1","R1","Gs1","Rs1","Ga1","Ra1")
        if(length(lbls)<rn/3) lbls <- c(lbls,"G2","R2","Gs2","Rs2","Ga2","Ra2")
        if(length(lbls)<rn/3) lbls <- c(lbls,"G3","R3","Gs3","Rs3","Ga3","Ra3")
        if(length(lbls)<rn/3) lbls <- c(lbls,"G4","R4","Gs4","Rs4","Ga4","Ra4")
        if(length(lbls)<rn/3) lbls <- c(lbls,"G5","R5","Gs5","Rs5","Ga5","Ra5")
        data.frame(MeanErr,LCL,UCL,row.names=lbls) -> mcl
        print(mcl)
        plot(meanerr,ylim=c(0,0.1),ylab="Error Rate",xlab="",axes=FALSE)
        axis(2)
        axis(1,at=1:(rn/3),labels=lbls)
        points(ucl,pch="-")
        points(lcl,pch="-")
        lines(ucl,type="h")
        lines(lcl,type="h",col="white")
        text(rn*13/54,0.097,sprintf("Bogofilter test, rounds 1-%0.0f",rn/18))
        sink()

    This R script (called bogaov.R) was invoked like this:

	outfile <- "bogoGR.sess"
	source("bogaov.R", echo=TRUE)

7.  The plot produced by bogaov.R was saved:

	import bogoGR.png
	    (click on the plot with the mouse)

Results:
"""""""

1.  Here is the output of the percentreport runs:

Combined corpora: bogofilter test

Round 1

Before training:

$ for table in c1-0 c1-1 c1-2; do maketable $table; done
Results for test series 25254, 509 messages including 133 spam:
  Gra(s)  fneg  fpos  Rob(s)  fneg  fpos  Gra(u)  fneg  fpos  Rob(u)  fneg  fpos
n    36     25    11     24     23     1     36     25    11     24     23     1
%0.0707 18.797 2.926 0.0472 17.293 0.266 0.0707 18.797 2.926 0.0472 17.293 0.266
Results for test series 27329, 509 messages including 133 spam:
  Gra(s)  fneg  fpos  Rob(s)  fneg  fpos  Gra(u)  fneg  fpos  Rob(u)  fneg  fpos
n    37     25    12     25     24     1     37     25    12     25     24     1
%0.0727 18.797 3.191 0.0491 18.045 0.266 0.0727 18.797 3.191 0.0491 18.045 0.266
Results for test series 29411, 509 messages including 133 spam:
  Gra(s)  fneg  fpos  Rob(s)  fneg  fpos  Gra(u)  fneg  fpos  Rob(u)  fneg  fpos
n    37     29     8     32     31     1     37     29     8     32     31     1
%0.0727 21.805 2.128 0.0629 23.308 0.266 0.0727 21.805 2.128 0.0629 23.308 0.266

After training:

$ for table in c1-0 c1-1 c1-2; do maketable $table; done
Results for test series 3892, 509 messages including 133 spam:
  Gra(s)  fneg  fpos  Rob(s)  fneg  fpos  Gra(u)  fneg  fpos  Rob(u)  fneg  fpos
n    21     21     0      1      1     0     29     18    11     15     13     2
%0.0413 15.789 0.000 0.0020  0.752 0.000 0.0570 13.534 2.926 0.0295  9.774 0.532
Results for test series 5961, 509 messages including 133 spam:
  Gra(s)  fneg  fpos  Rob(s)  fneg  fpos  Gra(u)  fneg  fpos  Rob(u)  fneg  fpos
n    20     20     0      3      3     0     35     22    13     21     18     3
%0.0393 15.038 0.000 0.0059  2.256 0.000 0.0688 16.541 3.457 0.0413 13.534 0.798
Results for test series 8030, 509 messages including 133 spam:
  Gra(s)  fneg  fpos  Rob(s)  fneg  fpos  Gra(u)  fneg  fpos  Rob(u)  fneg  fpos
n    29     29     0      1      1     0     33     25     8     18     16     2
%0.0570 21.805 0.000 0.0020  0.752 0.000 0.0648 18.797 2.128 0.0354 12.030 0.532

Round 2

Before training:

$ for table in c2-0 c2-1 c2-2; do maketable $table; done
Results for test series 16426, 509 messages including 133 spam:
  Gra(s)  fneg  fpos  Rob(s)  fneg  fpos  Gra(u)  fneg  fpos  Rob(u)  fneg  fpos
n    27     24     3     15     14     1     27     24     3     15     14     1
%0.0530 18.045 0.798 0.0295 10.526 0.266 0.0530 18.045 0.798 0.0295 10.526 0.266
Results for test series 18494, 509 messages including 133 spam:
  Gra(s)  fneg  fpos  Rob(s)  fneg  fpos  Gra(u)  fneg  fpos  Rob(u)  fneg  fpos
n    22     20     2     13     12     1     22     20     2     13     12     1
%0.0432 15.038 0.532 0.0255  9.023 0.266 0.0432 15.038 0.532 0.0255  9.023 0.266
Results for test series 20563, 509 messages including 133 spam:
  Gra(s)  fneg  fpos  Rob(s)  fneg  fpos  Gra(u)  fneg  fpos  Rob(u)  fneg  fpos
n    27     22     5     22     21     1     27     22     5     22     21     1
%0.0530 16.541 1.330 0.0432 15.789 0.266 0.0530 16.541 1.330 0.0432 15.789 0.266

After training:

Results for test series 27432, 509 messages including 133 spam:
  Gra(s)  fneg  fpos  Rob(s)  fneg  fpos  Gra(u)  fneg  fpos  Rob(u)  fneg  fpos
n    20     20     0      2      2     0     27     24     3     12      9     3
%0.0393 15.038 0.000 0.0039  1.504 0.000 0.0530 18.045 0.798 0.0236  6.767 0.798
Results for test series 29501, 509 messages including 133 spam:
  Gra(s)  fneg  fpos  Rob(s)  fneg  fpos  Gra(u)  fneg  fpos  Rob(u)  fneg  fpos
n    13     13     0      1      1     0     21     18     3     12     11     1
%0.0255  9.774 0.000 0.0020  0.752 0.000 0.0413 13.534 0.798 0.0236  8.271 0.266
Results for test series 31569, 509 messages including 133 spam:
  Gra(s)  fneg  fpos  Rob(s)  fneg  fpos  Gra(u)  fneg  fpos  Rob(u)  fneg  fpos
n    17     17     0      3      3     0     23     18     5     21     17     4
%0.0334 12.782 0.000 0.0059  2.256 0.000 0.0452 13.534 1.330 0.0413 12.782 1.064

Round 3

Before training:

$ for table in c3-0 c3-1 c3-2; do maketable $table; done
Results for test series 1192, 509 messages including 133 spam:
  Gra(s)  fneg  fpos  Rob(s)  fneg  fpos  Gra(u)  fneg  fpos  Rob(u)  fneg  fpos
n    26     24     2     15     14     1     26     24     2     15     14     1
%0.0511 18.045 0.532 0.0295 10.526 0.266 0.0511 18.045 0.532 0.0295 10.526 0.266
Results for test series 3261, 509 messages including 133 spam:
  Gra(s)  fneg  fpos  Rob(s)  fneg  fpos  Gra(u)  fneg  fpos  Rob(u)  fneg  fpos
n    34     31     3     24     24     0     34     31     3     24     24     0
%0.0668 23.308 0.798 0.0472 18.045 0.000 0.0668 23.308 0.798 0.0472 18.045 0.000
Results for test series 5330, 509 messages including 133 spam:
  Gra(s)  fneg  fpos  Rob(s)  fneg  fpos  Gra(u)  fneg  fpos  Rob(u)  fneg  fpos
n    30     27     3     24     22     2     30     27     3     24     22     2
%0.0589 20.301 0.798 0.0472 16.541 0.532 0.0589 20.301 0.798 0.0472 16.541 0.532

After training:

$ for table in c3-0 c3-1 c3-2; do maketable $table; done
Results for test series 12183, 509 messages including 133 spam:
  Gra(s)  fneg  fpos  Rob(s)  fneg  fpos  Gra(u)  fneg  fpos  Rob(u)  fneg  fpos
n    16     16     0      1      1     0     24     22     2     13     12     1
%0.0314 12.030 0.000 0.0020  0.752 0.000 0.0472 16.541 0.532 0.0255  9.023 0.266
Results for test series 14264, 509 messages including 133 spam:
  Gra(s)  fneg  fpos  Rob(s)  fneg  fpos  Gra(u)  fneg  fpos  Rob(u)  fneg  fpos
n    22     22     0      2      2     0     31     28     3     22     21     1
%0.0432 16.541 0.000 0.0039  1.504 0.000 0.0609 21.053 0.798 0.0432 15.789 0.266
Results for test series 16333, 509 messages including 133 spam:
  Gra(s)  fneg  fpos  Rob(s)  fneg  fpos  Gra(u)  fneg  fpos  Rob(u)  fneg  fpos
n    24     24     0      5      5     0     27     24     3     23     20     3
%0.0472 18.045 0.000 0.0098  3.759 0.000 0.0530 18.045 0.798 0.0452 15.038 0.798

Round 4

Before training:

$ for table in c4-0 c4-1 c4-2; do maketable $table; done
Results for test series 18451, 509 messages including 133 spam:
  Gra(s)  fneg  fpos  Rob(s)  fneg  fpos  Gra(u)  fneg  fpos  Rob(u)  fneg  fpos
n    29     25     4     18     16     2     29     25     4     18     16     2
%0.0570 18.797 1.064 0.0354 12.030 0.532 0.0570 18.797 1.064 0.0354 12.030 0.532
Results for test series 20522, 508 messages including 132 spam:
  Gra(s)  fneg  fpos  Rob(s)  fneg  fpos  Gra(u)  fneg  fpos  Rob(u)  fneg  fpos
n    26     21     5     19     17     2     26     21     5     19     17     2
%0.0512 15.909 1.330 0.0374 12.879 0.532 0.0512 15.909 1.330 0.0374 12.879 0.532
Results for test series 22587, 508 messages including 132 spam:
  Gra(s)  fneg  fpos  Rob(s)  fneg  fpos  Gra(u)  fneg  fpos  Rob(u)  fneg  fpos
n    27     23     4     21     19     2     27     23     4     21     19     2
%0.0531 17.424 1.064 0.0413 14.394 0.532 0.0531 17.424 1.064 0.0413 14.394 0.532

After training:

$ for table in c4-0 c4-1 c4-2; do maketable $table; done
Results for test series 29488, 509 messages including 133 spam:
  Gra(s)  fneg  fpos  Rob(s)  fneg  fpos  Gra(u)  fneg  fpos  Rob(u)  fneg  fpos
n    20     20     0      6      5     1     25     21     4     16     14     2
%0.0393 15.038 0.000 0.0118  3.759 0.266 0.0491 15.789 1.064 0.0314 10.526 0.532
Results for test series 31574, 508 messages including 132 spam:
  Gra(s)  fneg  fpos  Rob(s)  fneg  fpos  Gra(u)  fneg  fpos  Rob(u)  fneg  fpos
n    17     16     1      4      3     1     25     20     5     18     15     3
%0.0335 12.121 0.266 0.0079  2.273 0.266 0.0492 15.152 1.330 0.0354 11.364 0.798
Results for test series 1172, 508 messages including 132 spam:
  Gra(s)  fneg  fpos  Rob(s)  fneg  fpos  Gra(u)  fneg  fpos  Rob(u)  fneg  fpos
n    20     19     1      5      4     1     25     21     4     22     19     3
%0.0394 14.394 0.266 0.0098  3.030 0.266 0.0492 15.909 1.064 0.0433 14.394 0.798

2.  Here (slightly abbreviated) are the contents of the R session:

> bogo <- read.table("bogoGR.tab")
> bogo$Spamlist <- factor(bogo$Spamlist)
> bogo$Training <- factor(bogo$Training)
> bogo$Method <- factor(bogo$Method)
> bogo$Run <- factor(bogo$Run)
> attach(bogo)
> print(bogo)
   Spamlist   Training   Method Run  Error     Spamlist   Training   Method Run  Error
1      4840        Pre   Graham   1 0.0707  37     5638        Pre   Graham   1 0.0511
2      4840        Pre   Graham   2 0.0727  38     5638        Pre   Graham   2 0.0668
3      4840        Pre   Graham   3 0.0727  39     5638        Pre   Graham   3 0.0589
4      4840        Pre Robinson   1 0.0472  40     5638        Pre Robinson   1 0.0295
5      4840        Pre Robinson   2 0.0491  41     5638        Pre Robinson   2 0.0472
6      4840        Pre Robinson   3 0.0629  42     5638        Pre Robinson   3 0.0472
7      4840 Supervised   Graham   1 0.0413  43     5638 Supervised   Graham   1 0.0314
8      4840 Supervised   Graham   2 0.0393  44     5638 Supervised   Graham   2 0.0432
9      4840 Supervised   Graham   3 0.0570  45     5638 Supervised   Graham   3 0.0472
10     4840 Supervised Robinson   1 0.0020  46     5638 Supervised Robinson   1 0.0020
11     4840 Supervised Robinson   2 0.0059  47     5638 Supervised Robinson   2 0.0039
12     4840 Supervised Robinson   3 0.0020  48     5638 Supervised Robinson   3 0.0098
13     4840  Automated   Graham   1 0.0570  49     5638  Automated   Graham   1 0.0472
14     4840  Automated   Graham   2 0.0688  50     5638  Automated   Graham   2 0.0609
15     4840  Automated   Graham   3 0.0648  51     5638  Automated   Graham   3 0.0530
16     4840  Automated Robinson   1 0.0295  52     5638  Automated Robinson   1 0.0255
17     4840  Automated Robinson   2 0.0413  53     5638  Automated Robinson   2 0.0432
18     4840  Automated Robinson   3 0.0354  54     5638  Automated Robinson   3 0.0452
19     5239        Pre   Graham   1 0.0530  55     6037        Pre   Graham   1 0.0570
20     5239        Pre   Graham   2 0.0432  56     6037        Pre   Graham   2 0.0512
21     5239        Pre   Graham   3 0.0530  57     6037        Pre   Graham   3 0.0531
22     5239        Pre Robinson   1 0.0295  58     6037        Pre Robinson   1 0.0354
23     5239        Pre Robinson   2 0.0255  59     6037        Pre Robinson   2 0.0374
24     5239        Pre Robinson   3 0.0432  60     6037        Pre Robinson   3 0.0413
25     5239 Supervised   Graham   1 0.0393  61     6037 Supervised   Graham   1 0.0393
26     5239 Supervised   Graham   2 0.0255  62     6037 Supervised   Graham   2 0.0335
27     5239 Supervised   Graham   3 0.0334  63     6037 Supervised   Graham   3 0.0394
28     5239 Supervised Robinson   1 0.0039  64     6037 Supervised Robinson   1 0.0118
29     5239 Supervised Robinson   2 0.0020  65     6037 Supervised Robinson   2 0.0079
30     5239 Supervised Robinson   3 0.0059  66     6037 Supervised Robinson   3 0.0098
31     5239  Automated   Graham   1 0.0530  67     6037  Automated   Graham   1 0.0491
32     5239  Automated   Graham   2 0.0413  68     6037  Automated   Graham   2 0.0492
33     5239  Automated   Graham   3 0.0452  69     6037  Automated   Graham   3 0.0492
34     5239  Automated Robinson   1 0.0296  70     6037  Automated Robinson   1 0.0314
35     5239  Automated Robinson   2 0.0236  71     6037  Automated Robinson   2 0.0354
36     5239  Automated Robinson   3 0.0413  72     6037  Automated Robinson   3 0.0433

> bogaov <- aov(Error ~ Spamlist + Training + Method + 
    Run, data = bogo)
> print(summary(bogaov))
            Df    Sum Sq   Mean Sq  F value    Pr(>F)    
Spamlist     3 0.0014951 0.0004984   8.8808 5.412e-05 ***
Training     2 0.0101961 0.0050981  90.8453 < 2.2e-16 ***
Method       1 0.0094508 0.0094508 168.4094 < 2.2e-16 ***
Run          2 0.0004673 0.0002336   4.1631   0.02004 *  
Residuals   63 0.0035354 0.0000561                       
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

> replaov <- aov(Error ~ Spamlist + Training + Method + 
    Spamlist * Training + Spamlist * Method + Training * Method, 
    data = bogo)

> print(summary(replaov))
                  Df    Sum Sq   Mean Sq  F value    Pr(>F)    
Spamlist           3 0.0014951 0.0004984  12.5747 2.396e-06 ***
Training           2 0.0101961 0.0050981 128.6322 < 2.2e-16 ***
Method             1 0.0094508 0.0094508 238.4589 < 2.2e-16 ***
Spamlist:Training  6 0.0005042 0.0000840   2.1203   0.06577 .  
Spamlist:Method    3 0.0003346 0.0001115   2.8145   0.04778 *  
Training:Method    2 0.0010237 0.0005118  12.9145 2.608e-05 ***
Residuals         54 0.0021402 0.0000396                       
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

> print(TukeyHSD(replaov), digits = 4)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Error ~ Spamlist + Training + Method + Spamlist *
     Training + Spamlist * Method + Training * Method, data = bogo)

$Spamlist
               diff       lwr        upr
5239-4840 -0.012678 -0.018241 -0.0071149
5638-4840 -0.005911 -0.011474 -0.0003483
6037-4840 -0.008050 -0.013613 -0.0024872
5638-5239  0.006767  0.001204  0.0123295
6037-5239  0.004628 -0.000935  0.0101906
6037-5638 -0.002139 -0.007702  0.0034239

$Training
                          diff       lwr      upr
Pre-Automated         0.005642  0.001262  0.01002
Supervised-Automated -0.021946 -0.026326 -0.01757
Supervised-Pre       -0.027587 -0.031967 -0.02321

$Method
                    diff      lwr      upr
Robinson-Graham -0.02291 -0.02589 -0.01994

(Interaction printout elided)

> d <- c(1.95996, 0.412, 0.423)
> rn <- length(residuals(replaov))
> rdf <- rn - rn/6 - 6
> rms <- deviance(replaov)/rdf
> n <- 3

> errors <- array(bogo$Error, dim = c(3, rn/3))
> meanerr <- apply(errors, 2, mean)
> z <- (d[1] + 1/(rdf * d[2] - d[3])) * sqrt(rms/n)
> lcl <- pmax(0, meanerr - z)
> ucl <- meanerr + z

> MeanErr <- round(meanerr, digits = 5)
> LCL <- round(lcl, digits = 5)
> UCL <- round(ucl, digits = 5)

> lbls <- c("G1", "R1", "Gs1", "Rs1", "Ga1", "Ra1")
> if (length(lbls) < rn/3) lbls <- c(lbls, "G2", "R2", 
    "Gs2", "Rs2", "Ga2", "Ra2")
> if (length(lbls) < rn/3) lbls <- c(lbls, "G3", "R3", 
    "Gs3", "Rs3", "Ga3", "Ra3")
> if (length(lbls) < rn/3) lbls <- c(lbls, "G4", "R4", 
    "Gs4", "Rs4", "Ga4", "Ra4")
> if (length(lbls) < rn/3) lbls <- c(lbls, "G5", "R5", 
    "Gs5", "Rs5", "Ga5", "Ra5")
> mcl <- data.frame(MeanErr, LCL, UCL, row.names = lbls)
> print(mcl)
    MeanErr     LCL     UCL
G1  0.07203 0.06474 0.07932
R1  0.05307 0.04578 0.06036
Gs1 0.04587 0.03858 0.05316
Rs1 0.00330 0.00000 0.01059
Ga1 0.06353 0.05624 0.07082
Ra1 0.03540 0.02811 0.04269
G2  0.04973 0.04244 0.05702
R2  0.03273 0.02544 0.04002
Gs2 0.03273 0.02544 0.04002
Rs2 0.00393 0.00000 0.01122
Ga2 0.04650 0.03921 0.05379
Ra2 0.03150 0.02421 0.03879
G3  0.05893 0.05164 0.06622
R3  0.04130 0.03401 0.04859
Gs3 0.04060 0.03331 0.04789
Rs3 0.00523 0.00000 0.01252
Ga3 0.05370 0.04641 0.06099
Ra3 0.03797 0.03068 0.04526
G4  0.05377 0.04648 0.06106
R4  0.03803 0.03074 0.04532
Gs4 0.03740 0.03011 0.04469
Rs4 0.00983 0.00254 0.01712
Ga4 0.04917 0.04188 0.05646
Ra4 0.03670 0.02941 0.04399

> plot(meanerr, ylim = c(0, 0.08), ylab = "Error Rate", 
    xlab = "", axes = FALSE)
> axis(2)
> axis(1, at = 1:(rn/3), labels = lbls)
> points(ucl, pch = "-")
> points(lcl, pch = "-")
> lines(ucl, type = "h")
> lines(lcl, type = "h", col = "white")
> text(rn * 13/54, 0.078, sprintf("Bogofilter test, rounds 1-%0.0f", 
    rn/18))

> sink()

The accompanying file bogoGR.png contains a plot of the results.  It
shows mean and 95% confidence limits of the error rate for each method
(G and R), before training and after supervised or automated training
(s and a), for each round (1-4).

Discussion:
""""""""""

The effect of factor Run (ie, difference caused by whether it was the
first, second or third group of messages that was being processed) was
small enough to be dismissed; the runs were treated as simple replicas.

The observations that follow are apparent from the graph and are all
confirmed by the analysis:

1.  Robinson's method of calculation had a generally lower error rate
    than did Graham's.  (Compare Gx with Rx where x is 1, 2, 3 or 4.)
    Applying Robinson's calculation method, in this experiment,
    produced a diminution of between 0.0199 and 0.0259 in the overall
    error rate (those are 95% confidence limits; the mean was 0.0229)
    with respect to that achieved by Graham's method.  Since Graham's
    error rate averaged 0.0586 in this experiment, that means that the
    Robinson method performed between 34% and 44% better than did the
    Graham method.  This effect is highly significant, both
    statistically and practically.

2.  The effect of spamlist size (ie training in general) was highly
    significant; the second, third and fourth rounds all showed lower
    error rates than the first.  After the first round however, the
    training effect seen from round to round was small (within the
    level of random variation).  This is unsurprising, as the first
    round's training served to render the training database
    significantly more typical of the population of messages to be
    evaluated in subsequent rounds, and the training database was only
    growing by about 8% per round.  A general tendency to improve with
    further training might exist, but this experiment was too brief to
    demonstrate it.  There seemed to be little interaction between
    spamlist size and either of the other factors.

3.  When the data for a round were added to the training set and then
    re-evaluated, the effect of type of training (automated or
    supervised) was also highly significant; as expected, supervised
    training had a much better effect on performance than automated
    training had.  Letting bogofilter feed its own decisions into its
    training set seems not to be an effective way to train, unless the
    -S/-N options are used to correct its errors.  (Personally, I
    prefer to have bogofilter put what it thinks is spam in a separate
    folder, which I periodically review.  To that folder I also
    transfer any false negatives that appear among the regular email. 
    Then the regular archive and the spam folder are added with -n and
    -s so the training set never sees bad data -- well, hardly ever ;-)

    The improvement in bogofilter's performance when the training data
    were re-evaluated after supervised training was particularly
    dramatic with Robinson's method of calculation, which takes into
    account all the tokens in the message rather than just the most
    characteristic ones.  (Compare Gx with Gsx and Rx with Rsx where x
    is 1, 2, 3 or 4.)

    This significant training effect means that re-evaluating training
    data should not be used to test bogofilter's performance, since
    such tests will have no value in predicting performance in
    production, where most messages are new.

4.  Although generally beneficial (when errors are not allowed to
    accumulate), automatic training is inferior to supervised training
    in reducing the error rate when training data are re-evaluated.
    This too is unsurprising, as automatic training means that any
    errors made prior to training are entered into the training set
    and reinforce the likelihood of making the same error again.

Conclusions:
"""""""""""
We can respond to the questions posed in the introduction as follows:

1.  The results of this experiment indicate that Robinson's
    method of calculation is more likely to yield correct results than
    is the original method proposed by Graham.

2.  If a test set of messages is added to the training set and then
    re-evaluated, a significant learning effect is seen.  This effect
    is greater if the test set is manually corrected before addition.

[(C) Greg Louis, 2002; last modified 2002-11-03]
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bogoGR.png
Type: image/png
Size: 6603 bytes
Desc: not available
URL: <https://www.bogofilter.org/pipermail/bogofilter/attachments/20021103/4ceede64/attachment.png>