lexer change

Mon Nov 10 17:17:41 CET 2003

Hi!

Here is one more test about the two-byte-token/numerical patch.

Method 1 is with this patch, method 2 is 0.15.8, method 3 is
only two-byte-token, method 4 is only numeric.

Corpus sizes:

            t     r0     r1     r2    tot
sp      11679   3896   3892   3891  11679
ns      12722   4244   4239   4240  12723

Method 1 training: 2.9M
                       spam   good
.MSG_COUNT              586    318
wo (fn):  0.500000    22     18     24     64
wo (fp):  0.500000     5      4      6     15
wi (fn):  0.612968    53     35     41    129
wi (fp):  0.612968     2      2      2      6
wi (fn):  0.500248    22     18     24     64
wi (fp):  0.500248     5      4      5     14
wi (fn):  0.471041    14     15     12     41
wi (fp):  0.471041    13     10      9     32

Method 2 training: 2.7M
                       spam   good
.MSG_COUNT              592    307
wo (fn):  0.500000    26     23     19     68
wo (fp):  0.500000     5      4      4     13
wi (fn):  0.581092    50     41     41    132
wi (fp):  0.581092     3      2      1      6
wi (fn):  0.499993    26     23     19     68
wi (fp):  0.499993     6      4      5     15
wi (fn):  0.457261    15     15     14     44
wi (fp):  0.457261    14     10      8     32

Method 3 training: 2.8M
                       spam   good
.MSG_COUNT              630    284
wo (fn):  0.500000    24     22     22     68
wo (fp):  0.500000     4      4      3     11
wi (fn):  0.544564    40     30     31    101
wi (fp):  0.544564     3      1      2      6
wi (fn):  0.499999    24     22     21     67
wi (fp):  0.499999     5      4      4     13
wi (fn):  0.419627     8     12     15     35
wi (fp):  0.419627    12      8     11     31

Method 4 training: 2.8M
                       spam   good
.MSG_COUNT              584    312
wo (fn):  0.500000    26     24     24     74
wo (fp):  0.500000     4      5      4     13
wi (fn):  0.629161    57     49     42    148
wi (fp):  0.629161     3      1      2      6
wi (fn):  0.499967    25     24     24     73
wi (fp):  0.499967     4      6      4     14
wi (fn):  0.450762    13     19     12     44
wi (fp):  0.450762    12     11      9     32

So we see that method 3 (allowing two-byte-tokens) is most
useful, mainly because it helps identifying ham better.
Looking at number tokens does not change much, it even
performs a bit worse.

I made some other interesting observation. The lexer takes
care of prices in dollar of the form "$[0-9]+(\.[0-9]+)?".
Well, OK, but then why not of "(€|EUR) [0-9]+([.,][0-9]+)?"?
Since there are many currencies out there that would not
make too much sense. So my question is more if we need the
dollar case (not allowing 123.45$ or $100,000 at the same
time). Or if we need it why only this special case of a price?

So while we are at it let's also test this (removal of that
rule):

Method 5 training: 2.7M
                       spam   good
.MSG_COUNT              593    309
wo (fn):  0.500000    30     30     20     80
wo (fp):  0.500000     4      5      3     12
wi (fn):  0.546680    41     36     34    111
wi (fp):  0.546680     2      2      2      6
wi (fn):  0.499780    29     30     19     78
wi (fp):  0.499780     5      6      4     15
wi (fn):  0.457308    16     18     13     47
wi (fp):  0.457308    14     10      8     32

So not looking at those prices actually is a good idea to
avoid false positives for a very small price. Actually this
performs better than the unmodified release (method 2) for
small numbers of fp! In other words: The $-rule is as good
or bad as the other rules above.

General remarks:

As you can see I tried different false-positive-targets
(none, 6, 14, 32). Note that they are often missed (as
expected). So you always have to look at the false positives
which is also a very important plausiblity check.

Further note that those targets often lead to results nobody
would use, you sometimes increase false positives by a
factor two or three and only gain very few false negatives less.

But as you can see above, just looking at the unmodified
(wo) results tells you a lot.

pi