Test with different lexers
Boris 'pi' Piwinger
3.14 at logic.univie.ac.at
Tue Dec 2 14:19:23 CET 2003
Hi!
I have done another test with bogofilter's new lexer and my
version (http://piology.org/bogofilter/lexer_v3.l):
Corpus sizes:
t r0 r1 r2 tot
sp 12868 4293 4287 4287 12867
ns 14571 4861 4855 4856 14572
My version:
wo (fn): 0.950000 140 141 118 399
wo (fp): 0.950000 1 2 1 4
wi (fn): 0.978575 150 150 124 424
wi (fp): 0.978575 1 2 0 3
wi (fn): 0.972447 145 148 123 416
wi (fp): 0.972447 1 2 1 4
wi (fn): 0.918270 133 132 116 381
wi (fp): 0.918270 2 2 1 5
wi (fn): 0.664719 105 117 99 321
wi (fp): 0.664719 4 3 3 10
Original version:
wo (fn): 0.950000 148 149 124 421
wo (fp): 0.950000 1 1 2 4
wi (fn): 0.973601 151 158 133 442
wi (fp): 0.973601 1 1 1 3
wi (fn): 0.967104 150 154 128 432
wi (fp): 0.967104 1 1 2 4
wi (fn): 0.948838 148 148 124 420
wi (fp): 0.948838 1 2 2 5
wi (fn): 0.710234 114 120 107 341
wi (fp): 0.710234 4 4 2 10
Over the time we have introduced several special rules to
deal with specific problematic messages. My version has
removed some of those (different token front and back,
dollar rule, no short tokens, no numeric tokens, doctype
switch, maybe more).
With my mail collection those special treatments don't give
improvements, to the opposite, the simplified version has an
advantage for 5-10% fewer false negatives. While this is too
small to really say it is better, it is good enough to say
that it is at least as good.
IIRC it was Tom who gave a strong opinion why we should
really just let the statistics and don't intervene with
special rules. If you want to try, just replace the lexer
file and compile. It would be great if other people would
repeat the test.
pi, who has seen no error in incoming mail for eight days
now:-))
More information about the Bogofilter
mailing list