Test with different lexers

Tue Dec 2 14:19:23 CET 2003

Hi!

I have done another test with bogofilter's new lexer and my
version (http://piology.org/bogofilter/lexer_v3.l):

Corpus sizes:
            t     r0     r1     r2    tot
sp      12868   4293   4287   4287  12867
ns      14571   4861   4855   4856  14572

My version:
wo (fn):  0.950000   140    141    118    399
wo (fp):  0.950000     1      2      1      4
wi (fn):  0.978575   150    150    124    424
wi (fp):  0.978575     1      2      0      3
wi (fn):  0.972447   145    148    123    416
wi (fp):  0.972447     1      2      1      4
wi (fn):  0.918270   133    132    116    381
wi (fp):  0.918270     2      2      1      5
wi (fn):  0.664719   105    117     99    321
wi (fp):  0.664719     4      3      3     10

Original version:
wo (fn):  0.950000   148    149    124    421
wo (fp):  0.950000     1      1      2      4
wi (fn):  0.973601   151    158    133    442
wi (fp):  0.973601     1      1      1      3
wi (fn):  0.967104   150    154    128    432
wi (fp):  0.967104     1      1      2      4
wi (fn):  0.948838   148    148    124    420
wi (fp):  0.948838     1      2      2      5
wi (fn):  0.710234   114    120    107    341
wi (fp):  0.710234     4      4      2     10

Over the time we have introduced several special rules to
deal with specific problematic messages. My version has
removed some of those (different token front and back,
dollar rule, no short tokens, no numeric tokens, doctype
switch, maybe more).

With my mail collection those special treatments don't give
improvements, to the opposite, the simplified version has an
advantage for 5-10% fewer false negatives. While this is too
small to really say it is better, it is good enough to say
that it is at least as good.

IIRC it was Tom who gave a strong opinion why we should
really just let the statistics and don't intervene with
special rules. If you want to try, just replace the lexer
file and compile. It would be great if other people would
repeat the test.

pi, who has seen no error in incoming mail for eight days
now:-))