training on errors only, preliminary

Sat Nov 30 17:12:39 CET 2002

Bill Yerazunis replied to my report comparing Robinson-Fisher with
Bayes chain-rule calculations, and suggested training on errors _only_
(feed training messages chosen at random to bogofilter, one at a time,
and register them only if bogofilter gets it wrong or unknown).  I have
just completed a preliminary rerun of yesterday's experiment, and it's
awesome.

Quick background: I have 6372 spam and 6372 nonspam.  Each group is
split into one file of 2124 messages (for training) and three files of
1416 messages each (for testing).  Yesterday I trained on the entire
contents of the training files, and got the following results:

The spam cutoff had to be set very low, to values that gave 71 false
positives, in order for a comparison to be possible; even then, Bayes
chain rule using p(w) as the input datum was excluded, so I got:

    calc run fneg percent
1    chi   0   30    2.12
2    chi   1   25    1.77
3    chi   2   27    1.91
4 bcr-fw   0   91    6.43
5 bcr-fw   1   84    5.93
6 bcr-fw   2   96    6.78

In today's experiment, I started over again and trained on errors only,
thus:

Use formail to separate training spams and nonspams into individual
$files named {RANDOM}s or ${RANDOM}n respectively.

cat randomfn
#! /bin/sh
while true; do
  fnam=${RANDOM}$1
  if [ ! -f $fnam ]; then break; fi
done
cat >F/$fnam

mkdir F
formail -s randomfn s <t.sp
formail -s randomfn n <t.ns

Use "for file in *" to train or not, based on error

cat trainerr
#! /bin/sh
file=$1
expect=${file:0-1:1}
~/bin/bogobcr -d ~/bcr/db <$file
got=$?
if [ $got -eq 0 ]; then got="s"; elif [ $got -eq 1 ]; then got="n"; fi
if [ $got != $expect ]; then   
    echo "registering $file"   
    ~/bin/bogobcr -d ~/bcr/db -$expect <$file
fi

for file in *; do ~/bin/trainerr $file; done

bogoutil -w ../db .MSG_COUNT
                       spam   good
.MSG_COUNT              466    415

As evidenced by the .MSG_COUNT values, most of the training messages
were correctly identified and therefore not used in training.  With
this database, I ran the test classifications.  With the default spam
cutoff, the Robinson-Fisher method yielded 12, 6 and 6 false positives,
an average of 0.55%.  In yesterday's full-training experiment, the
corresponding false-positive counts were 47, 40 and 39.

The Bayes chain rule (Bcr) method still needed to be set to a lower
spam-cutoff in order to permit comparison; with the spam count set to
30, I got

    calc run fneg percent
1    chi   0   72    5.08
2    chi   1   54    3.81
3    chi   2   67    4.73
4 bcr-pw   0  197   13.91
5 bcr-pw   1  184   12.99
6 bcr-pw   2  191   13.49
7 bcr-fw   0  132    9.32
8 bcr-fw   1  139    9.82
9 bcr-fw   2  155   10.95

This isn't apples-to-apples because we're operating with a spam cutoff
that gives only 30 false positives.  If we use yesterday's 71 (5% false
positives, far too high for production use), then the results are not
that different from yesterday's:

    calc run fneg percent
1    chi   0   36    2.54
2    chi   1   25    1.77
3    chi   2   33    2.33
4 bcr-pw   0   45    3.18
5 bcr-pw   1   37    2.61
6 bcr-pw   2   37    2.61
7 bcr-fw   0   35    2.47
8 bcr-fw   1   24    1.69
9 bcr-fw   2   30    2.12

The Robinson-Fisher method still has the advantage -- it can achieve
2.1% false positives and 4.5% false negatives in this test -- but it
was the evaluation algorithm used in the training.  To be fair (this is
why this report is just preliminary), I need to run the test separately
for each algorithm, training with that algorithm.  Gotta do a bit more
coding for that, so it'll be along later; in the meantime the jury is
still out on Robinson-Fisher vs Bcr.  The important point, for
now, is TRAIN ON ERRORS ONLY (and unknowns, if operating a ternary
classification).

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |