testing of degeneration algorithm

Fri Aug 1 01:54:39 CEST 2003

Greetings,

The newly released version of bogofilter, i.e. version 0.14.1, has code to 
support Paul Graham's Degeneration algorithm (described in his article 
"Better Bayesian Filtering", http://www.paulgraham.com/better.html)

The degeneration code is enabled by the following command line switches:

       -Pd   - enables  token degeneration.
       -PD   - disables token degeneration (default).
       -Pf   - enables  first degeneration match (default).
       -PF   - enables  extreme score selection.

To learn whether token degenerate is useful or not, I designed the 
experiment described below.  I then ran the experiment twice using my 
incoming mail from June for the first run and the mail for July in the 
second.  The results are described below.

Experiment:

Divvy each month's spam and ham into 8 mbox files:

     t.ns - half of the ham,  to be used for training
     t.sp - half of the spam, to be used for training
     r0.ns - 1/6 of the ham,  for test set 1
     r0.sp - 1/6 of the spam, for test set 1
     r1.ns - 1/6 of the ham,  for test set 2
     r1.sp - 1/6 of the spam, for test set 2
     r2.ns - 1/6 of the ham,  for test set 3
     r2.sp - 1/6 of the spam, for test set 3

This is the test protocol designed by Greg Louis.  A distribution script 
runs formail and puts the 1st, 3rd, and 5th messages in the t.?? file, puts 
#2 in r0.??, #4 in r1.??, and #6 in r2.?? - and repeats this pattern for 
every 6 messages.  The half of the messages in t.ns and t.sp are used to 
populate the ham and spam databases.  Then the messages in r0.ns are all 
scored and the second highest (spammish) score is found.  This score is the 
spam_cutoff value used for scoring the messages in r0.sp.  The final 
results are the number of spam messages correctly classified and the number 
incorrectly classifed (a.k.a. false negatives).  These steps for r0.ns and 
r0.sp are then repeated twice - once for r1.ns and r1.sp and again for 
r2.ns and r2.sp.  The script displays the counts for at each stage of these 
tests and also displays the total number of false negatives (for the 3 
sub-tests).

The first step of the test is to create two versions of wordlist.db - 1 
using -Pi (case insensitive) and 1 using -PI (case sensitive).

To establish base line numbers, I run:

#1.  bogofilter -Pi -d test.i.d -- case insensitive for wordlist and for 
scoring
#2.  bogofilter -PI -d test.I.d -- case sensitive for wordlist and for scoring

For curiosity (to see the effect of mixing sensitive/insenitive 
environments) I also run:

#3.  bogofilter -PI -d test.i.d -- case insensitive for wordlist and case 
sensitive for scoring
#4.  bogofilter -Pi -d test.I.d -- case sensitive for wordlist and case 
insensitive for scoring

Note: As there are more distinct tokens in the -PI wordlist than in the -Pi 
wordlist, I expect i.i (#1) to have the poorest accuracy and I.I (#2) to be 
best, with i.I (#3) and I.i (#4) somewhere in between.

Bogofilter supports two options for enabling degeneration options:

-Pdf turns on degeneration and uses the first matching token to provide the 
word counts used for scoring.
-PdF turns on degeneration and uses the most extreme token, i.e. the one 
with score furthest from 0.5, for scoring.

The combination of two wordlists and two degeneration options gives 4 more 
tests:

#5 i.df - case insensitive/first match
#6 i.dF - case insensitive/most extreme
#7 I.df - case sensitive/first match
#8 I.dF - case sensitive/most extreme

As mentionned, I've used my June messages and my July messages to give two 
separate runs of the above experiment (with its 8 tests).  Here are the 
results:

June messages - 3481 good, 4606 spam
                          S    H      S    H      S    H  Corr Inc   FN
#1 i.i  sc 0.889032  r0 736  26  r1 738  23  r2 730  32  2204  81  3.54%
#2 i.I  sc 0.889032  r0 740  22  r1 738  23  r2 736  26  2214  71  3.11%
#3 I.i  sc 0.849353  r0 583 179  r1 590 171  r2 577 185  1750 535 23.41%
#4 I.I  sc 0.849353  r0 741  21  r1 741  20  r2 730  32  2212  73  3.19%
#5 i.df sc 0.889032  r0 734  28  r1 734  27  r2 729  33  2197  88  3.85%
#6 i.dF sc 0.889032  r0 734  28  r1 734  27  r2 728  34  2196  89  3.89%
#7 I.df sc 0.849353  r0 738  24  r1 731  30  r2 724  38  2193  92  4.03%
#8 I.dF sc 0.849353  r0 738  24  r1 731  30  r2 724  38  2193  92  4.03%

July messages - 3153 good, 3158 spam
                          S    H      S    H      S    H  Corr Inc   FN
#1 i.i  sc 0.898621  r0 519   3  r1 525   0  r2 523   0  1567   3  0.19%
#2 i.I  sc 0.898621  r0 517   5  r1 522   3  r2 522   1  1561   9  0.57%
#3 I.i  sc 0.838945  r0 504  18  r1 499  26  r2 499  24  1502  68  4.33%
#4 I.I  sc 0.838945  r0 519   3  r1 525   0  r2 523   0  1567   3  0.19%
#5 i.df sc 0.898621  r0 519   3  r1 525   0  r2 523   0  1567   3  0.19%
#6 i.dF sc 0.898621  r0 519   3  r1 525   0  r2 523   0  1567   3  0.19%
#7 I.df sc 0.838945  r0 519   3  r1 525   0  r2 523   0  1567   3  0.19%
#8 I.dF sc 0.838945  r0 519   3  r1 525   0  r2 523   0  1567   3  0.19%

Several observations can be made:

1 - I.I (full case sensitivity) is better than i.i (case insensitive)
2 - I.i is worst.
3 - degeneration is no better than i.i or I.I; mostly it's worse.
4 - the July messages are so distinctly spam or ham that case and 
degeneration don't matter.

The test scripts are available to anyone who wants to run the experiment 
using their own message collections.  Email me and I'll send you a tarball.

David