Testing training methods

Tue Nov 18 13:11:44 CET 2003

Hi!

In the past have have done several tests about training methods:
http://article.gmane.org/gmane.mail.bogofilter.general/4373
http://article.gmane.org/gmane.mail.bogofilter.general/5346
http://article.gmane.org/gmane.mail.bogofilter.general/5403

Here is another set of tests:

The first is with my new version of lexer, allowing tokens
of lenght one and two, numbers and slightly changed
characters at token front and back. All tests use the
default parameters (-C).

sizes of mboxes:
            t     r0     r1     r2    tot
sp      12085   4031   4027   4026  12084
ns      13396   4469   4464   4464  13397
ns: 13397, sp: 12084, target: 34

test: N (full training)
    wordlist  ns 13396, sp 12085
wo (fn):  0.950000   130    137    116    383
wo (fp):  0.950000     2      2      1      5
wi (fn):  0.498987    44     45     36    125
wi (fp):  0.498987    12     10     12     34

test: R (randomtrain)
    wordlist  ns 47, sp 401
wo (fn):  0.950000    69     67     54    190
wo (fp):  0.950000     8      8      4     20
wi (fn):  0.908152    45     43     33    121
wi (fp):  0.908152    14     11      9     34

test: M (one run of bogominitrain.pl)
    wordlist  ns 43, sp 252
wo (fn):  0.950000    85     92     67    244
wo (fp):  0.950000    10     28     18     56
wi (fn):  0.987733   194    182    162    538
wi (fp):  0.987733     4     19     11     34

test: Mf (bogominitrain.pl -fn)
    wordlist  ns 62, sp 495
wo (fn):  0.950000    61     60     58    179
wo (fp):  0.950000     2      3      2      7
wi (fn):  0.856059    24     29     27     80
wi (fp):  0.856059    10     14     10     34

sizes of the database:
 27M test.N.d/wordlist.db
1.7M test.R.d/wordlist.db
1.1M test.M.d/wordlist.db
1.7M test.Mf.d/wordlist.db

Note that there was no security margin used for the three
train on error methods, so those results are not as good as
you would see in normal production.

We clearly see:

- Neither one run of randomtrain nor bogominitrain.pl
  produces good results. Both have a high risk of false
  positives and leave many false negatives. I cannot explain
  why they both produce so different results here, they
  should be similar.

- Training to exhaustion (test Mf) again was the best method
  in the test, even without security margin.

The second run is as above, but with the lexer we now have
in CVS (including the removal of ' and ` at the end of a token).

test: N
    wordlist  ns 13396, sp 12085
wo (fn):  0.950000   140    143    122    405
wo (fp):  0.950000     1      3      2      6
wi (fn):  0.498735    43     46     35    124
wi (fp):  0.498735    14      7     13     34

test: R
    wordlist  ns 49, sp 413
wo (fn):  0.950000    89     96     88    273
wo (fp):  0.950000    11     11      6     28
wi (fn):  0.936851    76     76     72    224
wi (fp):  0.936851    12     12     10     34

test: M
    wordlist  ns 44, sp 340
wo (fn):  0.950000   170    150    161    481
wo (fp):  0.950000     8      8      7     23
wi (fn):  0.926234   137    119    124    380
wi (fp):  0.926234    10     12     12     34

test: Mf
    wordlist  ns 56, sp 611
wo (fn):  0.950000    86     90     63    239
wo (fp):  0.950000     4      3      4     11
wi (fn):  0.844118    41     35     33    109
wi (fp):  0.844118    12     11     11     34

 25M test.N.d/wordlist.db
1.5M test.R.d/wordlist.db
1.4M test.M.d/wordlist.db
2.0M test.Mf.d/wordlist.db

The results are similar to the first run. So it is
interesting to compare those. For tests N and MF gives
better results with the new lexer. For R the results look
totally different. M also doen't answer the question.

pi