evaluating possible new options

Wed May 14 23:09:46 CEST 2003

In an article (http://paulgraham.com/better.html) titled, "Better
Bayesian Filtering," Paul Graham suggests
- preserving case
- allowing exclamation points to be constituent characters in a token
- processing periods and commas so that IP addresses and dollar amounts
  are handled as units
- flagging tokens in To: From: Subject: and Return-Path: headers
- processing the contents of html A, IMG and FONT tags

Bogofilter already did some of that, at least optionally.  In an
experimental version, David has implemented the second and third of the
above suggestions fully, and has provided options -Pf, -Ph and -Pt that
implement the first, fourth and fifth.  He and I have been performing
experiments to evaluate the contribution these three options might make
to discrimination.  Here I report preliminary findings.

An experiment was performed to determine whether any of the new -P
options in bogofilter improves discrimination.  These are: -Pf, toggle
case folding; -Ph, toggle header tagging; and -Pt, toggle processing
the contents of certain HTML tags (A, IMG and FONT).

Files used were
grep -c '^From ' t.?? r[01].??
t.ns:6697
t.sp:9815
r0.ns:3349
r0.sp:4732
r1.ns:3348
r1.sp:4730

.ns files contained nonspam messages, while .sp files contained spams.

The following script was used to run the experiment.  In the option
descriptions, lower case letters (f, h and t) mean that the
corresponding options were disabled (no case folding, no header
tagging, no processing of HTML tag contents), while upper case
indicates options that were enabled.  The script first reads the
nonspams to establish a spam cutoff value that produces a specified
number of false positives; it then evaluates the spams with that cutoff
to determine how many false negatives result.

#! /bin/sh
#  runex for the new -P parameters
#  set train to yes to rebuild the training databases
train=yes

function getco () {
  res=`cat $* | ./bogofilter -d $fnam -o 0.1 -Mv | \
    perl -e ' $target = 6; while (<>) { ' \
         -e ' ($i, $d) = split; push @diffs, $d unless $i != 1; }' \
         -e ' die "dainbramage" unless scalar @diffs > 15;' \
         -e ' @s = sort { $a <=> $b } (@diffs); $co = $s[$target];' \
         -e ' while($co < 0.000001) { ++$target; $co = $s[$target]; }' \
         -e ' printf("%8.6f %d",1.0-$s[$target],$target);'`
}

files=(0fht 1fhT 2fHt 3fHT 4Fht 5FhT 6FHt 7FHT)
factors=("f h t" "f h T" "f H t" "f H T" "F h t" "F h T" "F H t" "F H T")
opts=(-Pfh -Pfht -Pf -Pt -Ph -Pht "" -Pt)

echo "fold head html run co fp fn" >Pparms.tbl
for i in `seq 0 7`; do
    fnam=${files[$i]}
    popt=${opts[$i]}
    if [ $train = yes ]; then
	/bin/rm -f $fnam/*
	./bogofilter -d $fnam $popt -s <t.sp
	./bogofilter -d $fnam $popt -n <t.ns
    fi
    getco r0.ns r1.ns
    fp=${res##* }; co=${res%% *}; let fp=$fp/2
    for num in 0 1; do
	./bogofilter -d $fnam $popt -o $co -Mv <r$num.sp \
	    >r$num.sp.$fnam
	fn=`grep -c -v '^1' r$num.sp.$fnam`
	echo "$fnam$num ${factors[$i]} $num $co $fp $fn" >>Pparms.tbl
    done
done

The data reduction was completed in R:

parms$sp <- rep(c(4732,4730),8)
parms$pc <- 100 * parms$fn / parms$sp
parms
      fold head html run       co fp fn   sp        pc
0fht0    f    h    t   0 0.500686  3 41 4732 0.8664413
0fht1    f    h    t   1 0.500686  3 42 4730 0.8879493
1fhT0    f    h    T   0 0.500866  3 38 4732 0.8030431
1fhT1    f    h    T   1 0.500866  3 41 4730 0.8668076
2fHt0    f    H    t   0 0.500021  3 32 4732 0.6762468
2fHt1    f    H    t   1 0.500021  3 37 4730 0.7822410
3fHT0    f    H    T   0 0.500053  3 32 4732 0.6762468
3fHT1    f    H    T   1 0.500053  3 33 4730 0.6976744
4Fht0    F    h    t   0 0.503836  3 51 4732 1.0777684
4Fht1    F    h    t   1 0.503836  3 52 4730 1.0993658
5FhT0    F    h    T   0 0.505375  3 49 4732 1.0355030
5FhT1    F    h    T   1 0.505375  3 52 4730 1.0993658
6FHt0    F    H    t   0 0.500011  3 31 4732 0.6551141
6FHt1    F    H    t   1 0.500011  3 33 4730 0.6976744
7FHT0    F    H    T   0 0.500053  3 32 4732 0.6762468
7FHT1    F    H    T   1 0.500053  3 33 4730 0.6976744

While not strictly rigorous in this context, an analysis of variance
facilitates the interpretation of the above data.

summary(aov(pc ~ fold + head + html + fold*head + fold*html +
+   head*html + fold*head*html, data=parms))
               Df   Sum Sq  Mean Sq  F value    Pr(>F)    
fold            1 0.038226 0.038226  26.5486 0.0008716 ***
head            1 0.296242 0.296242 205.7430 5.448e-07 ***
html            1 0.002262 0.002262   1.5709 0.2454608    
fold:head       1 0.061685 0.061685  42.8410 0.0001794 ***
fold:html       1 0.001369 0.001369   0.9504 0.3581594    
head:html       1 0.000251 0.000251   0.1743 0.6872818    
fold:head:html  1 0.000251 0.000251   0.1746 0.6870339    
Residuals       8 0.011519 0.001440                       

Folding case significantly increased the rate of error; it seems
advisable to run bogofilter with case folding turned off, as Paul
Graham suggested.

Tagging headers significantly decreased the rate of error, especially
when case folding was in effect; it would be well to run bogofilter
with header tagging enabled.  (Other experiments by David and me
confirm this finding, but show that even with header tagging enabled,
it's advantageous to disable case folding.)

Processing the contents of A, IMG and FONT html tags made little
difference in this experiment; however, other experiments by David and
me gave conflicting results as to this effect.  We believe it needs
further study.

-- 
| G r e g  L o u i s          | gpg public key: finger     |
|   http://www.bgl.nu/~glouis |   glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |