training on errors only, preliminary
Greg Louis
glouis at dynamicro.on.ca
Sat Nov 30 17:12:39 CET 2002
Bill Yerazunis replied to my report comparing Robinson-Fisher with
Bayes chain-rule calculations, and suggested training on errors _only_
(feed training messages chosen at random to bogofilter, one at a time,
and register them only if bogofilter gets it wrong or unknown). I have
just completed a preliminary rerun of yesterday's experiment, and it's
awesome.
Quick background: I have 6372 spam and 6372 nonspam. Each group is
split into one file of 2124 messages (for training) and three files of
1416 messages each (for testing). Yesterday I trained on the entire
contents of the training files, and got the following results:
The spam cutoff had to be set very low, to values that gave 71 false
positives, in order for a comparison to be possible; even then, Bayes
chain rule using p(w) as the input datum was excluded, so I got:
calc run fneg percent
1 chi 0 30 2.12
2 chi 1 25 1.77
3 chi 2 27 1.91
4 bcr-fw 0 91 6.43
5 bcr-fw 1 84 5.93
6 bcr-fw 2 96 6.78
In today's experiment, I started over again and trained on errors only,
thus:
Use formail to separate training spams and nonspams into individual
$files named {RANDOM}s or ${RANDOM}n respectively.
cat randomfn
#! /bin/sh
while true; do
fnam=${RANDOM}$1
if [ ! -f $fnam ]; then break; fi
done
cat >F/$fnam
mkdir F
formail -s randomfn s <t.sp
formail -s randomfn n <t.ns
Use "for file in *" to train or not, based on error
cat trainerr
#! /bin/sh
file=$1
expect=${file:0-1:1}
~/bin/bogobcr -d ~/bcr/db <$file
got=$?
if [ $got -eq 0 ]; then got="s"; elif [ $got -eq 1 ]; then got="n"; fi
if [ $got != $expect ]; then
echo "registering $file"
~/bin/bogobcr -d ~/bcr/db -$expect <$file
fi
for file in *; do ~/bin/trainerr $file; done
bogoutil -w ../db .MSG_COUNT
spam good
.MSG_COUNT 466 415
As evidenced by the .MSG_COUNT values, most of the training messages
were correctly identified and therefore not used in training. With
this database, I ran the test classifications. With the default spam
cutoff, the Robinson-Fisher method yielded 12, 6 and 6 false positives,
an average of 0.55%. In yesterday's full-training experiment, the
corresponding false-positive counts were 47, 40 and 39.
The Bayes chain rule (Bcr) method still needed to be set to a lower
spam-cutoff in order to permit comparison; with the spam count set to
30, I got
calc run fneg percent
1 chi 0 72 5.08
2 chi 1 54 3.81
3 chi 2 67 4.73
4 bcr-pw 0 197 13.91
5 bcr-pw 1 184 12.99
6 bcr-pw 2 191 13.49
7 bcr-fw 0 132 9.32
8 bcr-fw 1 139 9.82
9 bcr-fw 2 155 10.95
This isn't apples-to-apples because we're operating with a spam cutoff
that gives only 30 false positives. If we use yesterday's 71 (5% false
positives, far too high for production use), then the results are not
that different from yesterday's:
calc run fneg percent
1 chi 0 36 2.54
2 chi 1 25 1.77
3 chi 2 33 2.33
4 bcr-pw 0 45 3.18
5 bcr-pw 1 37 2.61
6 bcr-pw 2 37 2.61
7 bcr-fw 0 35 2.47
8 bcr-fw 1 24 1.69
9 bcr-fw 2 30 2.12
The Robinson-Fisher method still has the advantage -- it can achieve
2.1% false positives and 4.5% false negatives in this test -- but it
was the evaluation algorithm used in the training. To be fair (this is
why this report is just preliminary), I need to run the test separately
for each algorithm, training with that algorithm. Gotta do a bit more
coding for that, so it'll be along later; in the meantime the jury is
still out on Robinson-Fisher vs Bcr. The important point, for
now, is TRAIN ON ERRORS ONLY (and unknowns, if operating a ternary
classification).
--
| G r e g L o u i s | gpg public key: |
| http://www.bgl.nu/~glouis | finger greg at bgl.nu |
More information about the Bogofilter
mailing list