lexer change
Boris 'pi' Piwinger
3.14 at logic.univie.ac.at
Mon Nov 10 17:17:41 CET 2003
Hi!
Here is one more test about the two-byte-token/numerical patch.
Method 1 is with this patch, method 2 is 0.15.8, method 3 is
only two-byte-token, method 4 is only numeric.
Corpus sizes:
t r0 r1 r2 tot
sp 11679 3896 3892 3891 11679
ns 12722 4244 4239 4240 12723
Method 1 training: 2.9M
spam good
.MSG_COUNT 586 318
wo (fn): 0.500000 22 18 24 64
wo (fp): 0.500000 5 4 6 15
wi (fn): 0.612968 53 35 41 129
wi (fp): 0.612968 2 2 2 6
wi (fn): 0.500248 22 18 24 64
wi (fp): 0.500248 5 4 5 14
wi (fn): 0.471041 14 15 12 41
wi (fp): 0.471041 13 10 9 32
Method 2 training: 2.7M
spam good
.MSG_COUNT 592 307
wo (fn): 0.500000 26 23 19 68
wo (fp): 0.500000 5 4 4 13
wi (fn): 0.581092 50 41 41 132
wi (fp): 0.581092 3 2 1 6
wi (fn): 0.499993 26 23 19 68
wi (fp): 0.499993 6 4 5 15
wi (fn): 0.457261 15 15 14 44
wi (fp): 0.457261 14 10 8 32
Method 3 training: 2.8M
spam good
.MSG_COUNT 630 284
wo (fn): 0.500000 24 22 22 68
wo (fp): 0.500000 4 4 3 11
wi (fn): 0.544564 40 30 31 101
wi (fp): 0.544564 3 1 2 6
wi (fn): 0.499999 24 22 21 67
wi (fp): 0.499999 5 4 4 13
wi (fn): 0.419627 8 12 15 35
wi (fp): 0.419627 12 8 11 31
Method 4 training: 2.8M
spam good
.MSG_COUNT 584 312
wo (fn): 0.500000 26 24 24 74
wo (fp): 0.500000 4 5 4 13
wi (fn): 0.629161 57 49 42 148
wi (fp): 0.629161 3 1 2 6
wi (fn): 0.499967 25 24 24 73
wi (fp): 0.499967 4 6 4 14
wi (fn): 0.450762 13 19 12 44
wi (fp): 0.450762 12 11 9 32
So we see that method 3 (allowing two-byte-tokens) is most
useful, mainly because it helps identifying ham better.
Looking at number tokens does not change much, it even
performs a bit worse.
I made some other interesting observation. The lexer takes
care of prices in dollar of the form "$[0-9]+(\.[0-9]+)?".
Well, OK, but then why not of "(€|EUR) [0-9]+([.,][0-9]+)?"?
Since there are many currencies out there that would not
make too much sense. So my question is more if we need the
dollar case (not allowing 123.45$ or $100,000 at the same
time). Or if we need it why only this special case of a price?
So while we are at it let's also test this (removal of that
rule):
Method 5 training: 2.7M
spam good
.MSG_COUNT 593 309
wo (fn): 0.500000 30 30 20 80
wo (fp): 0.500000 4 5 3 12
wi (fn): 0.546680 41 36 34 111
wi (fp): 0.546680 2 2 2 6
wi (fn): 0.499780 29 30 19 78
wi (fp): 0.499780 5 6 4 15
wi (fn): 0.457308 16 18 13 47
wi (fp): 0.457308 14 10 8 32
So not looking at those prices actually is a good idea to
avoid false positives for a very small price. Actually this
performs better than the unmodified release (method 2) for
small numbers of fp! In other words: The $-rule is as good
or bad as the other rules above.
General remarks:
As you can see I tried different false-positive-targets
(none, 6, 14, 32). Note that they are often missed (as
expected). So you always have to look at the false positives
which is also a very important plausiblity check.
Further note that those targets often lead to results nobody
would use, you sometimes increase false positives by a
factor two or three and only gain very few false negatives less.
But as you can see above, just looking at the unmodified
(wo) results tells you a lot.
pi
More information about the Bogofilter
mailing list