testing of degeneration algorithm
David Relson
relson at osagesoftware.com
Fri Aug 1 01:54:39 CEST 2003
Greetings,
The newly released version of bogofilter, i.e. version 0.14.1, has code to
support Paul Graham's Degeneration algorithm (described in his article
"Better Bayesian Filtering", http://www.paulgraham.com/better.html)
The degeneration code is enabled by the following command line switches:
-Pd - enables token degeneration.
-PD - disables token degeneration (default).
-Pf - enables first degeneration match (default).
-PF - enables extreme score selection.
To learn whether token degenerate is useful or not, I designed the
experiment described below. I then ran the experiment twice using my
incoming mail from June for the first run and the mail for July in the
second. The results are described below.
Experiment:
Divvy each month's spam and ham into 8 mbox files:
t.ns - half of the ham, to be used for training
t.sp - half of the spam, to be used for training
r0.ns - 1/6 of the ham, for test set 1
r0.sp - 1/6 of the spam, for test set 1
r1.ns - 1/6 of the ham, for test set 2
r1.sp - 1/6 of the spam, for test set 2
r2.ns - 1/6 of the ham, for test set 3
r2.sp - 1/6 of the spam, for test set 3
This is the test protocol designed by Greg Louis. A distribution script
runs formail and puts the 1st, 3rd, and 5th messages in the t.?? file, puts
#2 in r0.??, #4 in r1.??, and #6 in r2.?? - and repeats this pattern for
every 6 messages. The half of the messages in t.ns and t.sp are used to
populate the ham and spam databases. Then the messages in r0.ns are all
scored and the second highest (spammish) score is found. This score is the
spam_cutoff value used for scoring the messages in r0.sp. The final
results are the number of spam messages correctly classified and the number
incorrectly classifed (a.k.a. false negatives). These steps for r0.ns and
r0.sp are then repeated twice - once for r1.ns and r1.sp and again for
r2.ns and r2.sp. The script displays the counts for at each stage of these
tests and also displays the total number of false negatives (for the 3
sub-tests).
The first step of the test is to create two versions of wordlist.db - 1
using -Pi (case insensitive) and 1 using -PI (case sensitive).
To establish base line numbers, I run:
#1. bogofilter -Pi -d test.i.d -- case insensitive for wordlist and for
scoring
#2. bogofilter -PI -d test.I.d -- case sensitive for wordlist and for scoring
For curiosity (to see the effect of mixing sensitive/insenitive
environments) I also run:
#3. bogofilter -PI -d test.i.d -- case insensitive for wordlist and case
sensitive for scoring
#4. bogofilter -Pi -d test.I.d -- case sensitive for wordlist and case
insensitive for scoring
Note: As there are more distinct tokens in the -PI wordlist than in the -Pi
wordlist, I expect i.i (#1) to have the poorest accuracy and I.I (#2) to be
best, with i.I (#3) and I.i (#4) somewhere in between.
Bogofilter supports two options for enabling degeneration options:
-Pdf turns on degeneration and uses the first matching token to provide the
word counts used for scoring.
-PdF turns on degeneration and uses the most extreme token, i.e. the one
with score furthest from 0.5, for scoring.
The combination of two wordlists and two degeneration options gives 4 more
tests:
#5 i.df - case insensitive/first match
#6 i.dF - case insensitive/most extreme
#7 I.df - case sensitive/first match
#8 I.dF - case sensitive/most extreme
As mentionned, I've used my June messages and my July messages to give two
separate runs of the above experiment (with its 8 tests). Here are the
results:
June messages - 3481 good, 4606 spam
S H S H S H Corr Inc FN
#1 i.i sc 0.889032 r0 736 26 r1 738 23 r2 730 32 2204 81 3.54%
#2 i.I sc 0.889032 r0 740 22 r1 738 23 r2 736 26 2214 71 3.11%
#3 I.i sc 0.849353 r0 583 179 r1 590 171 r2 577 185 1750 535 23.41%
#4 I.I sc 0.849353 r0 741 21 r1 741 20 r2 730 32 2212 73 3.19%
#5 i.df sc 0.889032 r0 734 28 r1 734 27 r2 729 33 2197 88 3.85%
#6 i.dF sc 0.889032 r0 734 28 r1 734 27 r2 728 34 2196 89 3.89%
#7 I.df sc 0.849353 r0 738 24 r1 731 30 r2 724 38 2193 92 4.03%
#8 I.dF sc 0.849353 r0 738 24 r1 731 30 r2 724 38 2193 92 4.03%
July messages - 3153 good, 3158 spam
S H S H S H Corr Inc FN
#1 i.i sc 0.898621 r0 519 3 r1 525 0 r2 523 0 1567 3 0.19%
#2 i.I sc 0.898621 r0 517 5 r1 522 3 r2 522 1 1561 9 0.57%
#3 I.i sc 0.838945 r0 504 18 r1 499 26 r2 499 24 1502 68 4.33%
#4 I.I sc 0.838945 r0 519 3 r1 525 0 r2 523 0 1567 3 0.19%
#5 i.df sc 0.898621 r0 519 3 r1 525 0 r2 523 0 1567 3 0.19%
#6 i.dF sc 0.898621 r0 519 3 r1 525 0 r2 523 0 1567 3 0.19%
#7 I.df sc 0.838945 r0 519 3 r1 525 0 r2 523 0 1567 3 0.19%
#8 I.dF sc 0.838945 r0 519 3 r1 525 0 r2 523 0 1567 3 0.19%
Several observations can be made:
1 - I.I (full case sensitivity) is better than i.i (case insensitive)
2 - I.i is worst.
3 - degeneration is no better than i.i or I.I; mostly it's worse.
4 - the July messages are so distinctly spam or ham that case and
degeneration don't matter.
The test scripts are available to anyone who wants to run the experiment
using their own message collections. Email me and I'll send you a tarball.
David
More information about the Bogofilter
mailing list