lexer change

Tue Nov 11 11:55:06 CET 2003

Boris 'pi' Piwinger wrote:

> Method 3 training: 2.8M

Recall this was allowing two-byte tokens.

>                        spam   good
> .MSG_COUNT              630    284
> wo (fn):  0.500000    24     22     22     68
> wo (fp):  0.500000     4      4      3     11
> wi (fn):  0.544564    40     30     31    101
> wi (fp):  0.544564     3      1      2      6
> wi (fn):  0.499999    24     22     21     67
> wi (fp):  0.499999     5      4      4     13
> wi (fn):  0.419627     8     12     15     35
> wi (fp):  0.419627    12      8     11     31

> So we see that method 3 (allowing two-byte-tokens) is most
> useful, mainly because it helps identifying ham better.
> Looking at number tokens does not change much, it even
> performs a bit worse.

I have another check now allowing one byte tokens:
TOKEN           {TOKENFRONT}({TOKENMID}{TOKENBACK})?

Only 2.6M.
                       spam   good
.MSG_COUNT              561    296
wo (fn):  0.500000    20     18     15     53
wo (fp):  0.500000     6      7      3     16
wi (fn):  0.578271    39     28     32     99
wi (fp):  0.578271     3      1      2      6
wi (fn):  0.503642    21     18     18     57
wi (fp):  0.503642     5      6      3     14
wi (fn):  0.439392     7     10     12     29
wi (fp):  0.439392    11     11      9     31

Very surprisingly it increases the number of false
positives, but adjusted to different targets it performs
better. So what do we do with this result?

So let's see if removing the dollar rule helps:
Now again 2.7M, still pretty small.
                       spam   good
.MSG_COUNT              598    304
wo (fn):  0.500000    24     14     16     54
wo (fp):  0.500000     6      5      2     13
wi (fn):  0.598652    50     38     36    124
wi (fp):  0.598652     3      2      1      6
wi (fn):  0.499987    23     14     15     52
wi (fp):  0.499987     6      6      2     14
wi (fn):  0.450249    12     12     13     37
wi (fp):  0.450249    13     12      6     31

Now, that get's us back into business. The fp's go down.
Again the different targets give contradictory results. Very
confusing.

So to be on the safe side: Two-byte-tokens are clearly
beneficial. The rest is unclear.

pi