Graham's method seemed better (was: man bogoutil)

Sat Nov 23 03:19:20 CET 2002

On 20021122 (Fri) at 0938:17 +0100, Boris 'pi' Piwinger wrote:
> Greg Louis <glouis at dynamicro.on.ca> wrote:
> 
> >> Actully, I just today removed the -r which gave pretty poor
> >> result for me (missing a lot). I don't have statistics on
> >> that, though.
> >
> >Surprising.  If you have a reasonably large training set and could
> >_get_ stats, it would be good to see them, 'coz Robinson's approach is
> >working quite a bit better than the original for most people who've
> >compared them.  
> 
> OK, sorry, I don't have time to work out statistics. I am
> willing to share my training sets (continually growing),
> though. Since I changed, the number of spam missed dropped
> significantly. So my feeling proved right for me.

Well, I now have results from a look at the emails you allowed me to
download.

You provided me with two files containing known good email and three
files containing known spam.

I took one-third of these for training, and split the remaining
messages into three files of equal size, like this:

  $ cat ham* >corpus.good; cat spam* >corpus.bad
  $ cat distrib
   #! /bin/sh
   # usage: FILENO=0 formail -s ./distrib [bad|good] <corpus.[bad|good]
   FILE=(t.$1 r1.$1 r2.$1 t.$1 r0.$1 r1.$1 t.$1 r2.$1 r0.$1)
   let n=${FILENO}%9
   fname=${FILE[$n]}
   cat >>$fname

  $ FILENO=0 formail -s ./distrib bad <corpus.bad
  $ FILENO=0 formail -s ./distrib good <corpus.good

Then I performed the training, thus:

  $ mkdir rob
  $ ~/bin/bogovis -d rob -r -s </store/safe/pi_mail/t.bad 
  $ ~/bin/bogovis -d rob -r -n </store/safe/pi_mail/t.good 
  $ mkdir gra
  $ ~/bin/bogovis -d gra -g -s </store/safe/pi_mail/t.bad  
  $ ~/bin/bogovis -d gra -g -n </store/safe/pi_mail/t.good 

bogovis is a bogofilter binary that supports Graham's method and the
original Robinson method.  /usr/bin/bogofilter, on this machine,
supports only the Fisher-based variant of the Robinson method.

$ bogo=~/bin/bogovis
$ bogu=/usr/bin/bogofilter
$ for run in 0 1 2 ; do
>  formail -s $bogo -d gra -g -v </store/safe/pi_mail/r$run.bad &> sp-$run.gra
>  formail -s $bogo -d rob -r -v </store/safe/pi_mail/r$run.bad &> sp-$run.rob
>  formail -s $bogu -d rob -v </store/safe/pi_mail/r$run.bad &> sp-$run.fis
>  formail -s $bogo -d gra -g -v </store/safe/pi_mail/r$run.good &> ns-$run.gra
>  formail -s $bogo -d rob -r -v </store/safe/pi_mail/r$run.good &> ns-$run.rob
>  formail -s $bogu -d rob -v </store/safe/pi_mail/r$run.good &> ns-$run.fis
> done

There were then some manipulations to cut just the spamicities from the
outputs of these runs.  It's not essential to show them here, but if
you're interested, I can provide the details.

The rest of the work was done in R; I'm appending the actual command
sequence, but here's a summary:

The Robinson method, with a spam_cutoff arbitrarily set to 0.55, gave
an average of 2.33 false positives in its three runs.  I therefore
calculated cutoffs, for all three methods, that would give
ceiling(2.33), ie 3, false positives.

Next I applied the cutoffs to each method's three test runs, obtaining
for each the count of false negatives.  This gave me nine values, three
for each method:

  calc run  fn
1  gra   0  96
2  gra   1 107
3  gra   2 123
4  rob   0  78
5  rob   1  81
6  rob   2  97
7  fis   0  70
8  fis   1  82
9  fis   2  96

An analysis of variance showed that the difference between the Graham
results and those from the other two methods was not quite
statistically significant; that was not surprising, given that we had
only 4700-odd spams to work with and I could only train with a third of
them.  However, I used the anova results to calculate the 95%
confidence limits of the means for the three methods:

  calc  meanfn  lcl95   ucl95
1  gra  108.67  91.21  126.13
2  rob   85.33  67.87  102.79
3  fis   82.67  65.21  100.13

The attached .png file offers a graph of these plots: the horizontal
lines are the mean false-negative counts, and the ends of the vertical
lines indicate the 95% confidence limits.

We can conclude that it is unlikely to be true that Graham's
calculation method misses fewer spams than Robinson's, or than the
variant of Robinson's that is based on Fisher's method of combining
probabilities.  I strongly suspect that, instead of abandoning the
Robinson method and reverting to Graham's calculation, it would have
been wise to correct the tuning of Robinson's method (adjust the values
of spam_cutoff, x and s) to give optimal results; these would, I
believe, be superior to what Graham's original calculation method can
deliver.  Even the present experiment strongly supports the conclusion
that they would be no worse.

Appendix: detailed procedure in R:

# get vectors of spam and nonspam data
read.table("F/sp.dat") -> Ssp
read.table("F/ns.dat") -> Sns
# make them into data frames with calc and run info
data.frame(
  calc=c(rep("gra",8352),rep("rob",8352),rep("fis",8352)),
  run=c(rep(c(rep(0,2784),rep(1,2784),rep(2,2784)),3)),
  S=Sns$V1) -> ns
write.table(ns,"F/ns.tbl") 
data.frame(
  calc=c(rep("gra",1371),rep("rob",1371),rep("fis",1371)),
  run=c(rep(c(rep(0,457), rep(1,457),rep(2,457)),3)),
  S=Ssp$V1) -> sp
write.table(sp,"F/sp.tbl")
# calculate the spam_cutoff values to give constant false-positive
# counts
stdcutoff <- 0.55
fp0 <- sum(ns$S[8353:11137] > stdcutoff)
fp1 <- sum(ns$S[11138:13921] > stdcutoff)
fp2 <- sum(ns$s[13922:16703] > stdcutoff)
fp <- ceiling(mean(c(fp0,fp1,fp2)))
ts <- sort(ns$S[1:2784],decreasing=TRUE)
fp0 <- ts[fp]
ts <- sort(ns$S[2785:5568],decreasing=TRUE)
fp1 <- ts[fp]
ts <- sort(ns$S[5569: 8352],decreasing=TRUE)
fp2 <- ts[fp]
gcutoff <- min(c(fp0,fp1,fp2))
ts <- sort(ns$S[16705:19488],decreasing=TRUE)
fp0 <- ts[fp]
ts <- sort(ns$S[19489:22272],decreasing=TRUE)
fp1 <- ts[fp]
ts <- sort(ns$S[22273:25056],decreasing=TRUE)
fp2 <- ts[fp]
fcutoff <- mean(c(fp0,fp1,fp2))
# obtain false-negative counts for these cutoffs
fn0 <- sum(sp$S[1:457] <= gcutoff)
fn1 <- sum(sp$S[458:914] <= gcutoff)
fn2 <- sum(sp$S[915:1371] <= gcutoff)
c(fn0,fn1,fn2) -> fng
fn0 <- sum(sp$S[1372:1828] <= stdcutoff)
fn1 <- sum(sp$S[1829:2285] <= stdcutoff)
fn2 <- sum(sp$S[2286:2742] <= stdcutoff)
c(fn0,fn1,fn2) -> fnr
fn0 <- sum(sp$S[2743:3199] <= fcutoff)
fn1 <- sum(sp$S[3200:3656] <= fcutoff)
fn2 <- sum(sp$S[3657:4113] <= fcutoff)
c(fn0,fn1,fn2) -> fnf
# make a data frame with the false-negative counts
data.frame(calc=c(rep("gra",3),rep("rob",3),rep("fis",3)),
run=c(0,1,2,0,1,2,0,1,2),
fn=c(fng,fnr,fnf)) -> res
write.table(res,"F/pi.results")
# perform analysis of variance and calculate means and c.l.
piaov <- aov(fn ~ calc, data=res)
rdf <- 6
rms <- deviance(piaov) / rdf
d <- c(1.95996, 0.412, 0.423)
z <- (d[1] + 1 / (rdf * d[2] - d[3])) * sqrt(rms/3)
meanfn <- round(apply(array(res$fn,dim=c(3,3)), 2, mean), digits=2)
lcl <- meanfn - z
ucl <- meanfn + z
# write a data frame with the means and c.l.
data.frame(calc=c("gra","rob","fis"), meanfn=meanfn,
lcl95=round(lcl,digits=2), ucl95=round(ucl,digits=2)) -> pi.outcome
write.table(pi.outcome,"pi.result")
# produce a plot of the means and c.l.
X11(width=5,height=5)
plot(pi.outcome$calc, pi.outcome$meanfn, main="Boris's emails",
xlab="Calculation method", ylab = "False negatives", ylim=c(60,130))
lines(pi.outcome$calc, pi.outcome$ucl, type="h")
lines(pi.outcome$calc, pi.outcome$lcl, type="h", col="white")

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pi_result.png
Type: image/png
Size: 5445 bytes
Desc: not available
URL: <https://www.bogofilter.org/pipermail/bogofilter/attachments/20021122/14927c61/attachment.png>