lexer change
Boris 'pi' Piwinger
3.14 at logic.univie.ac.at
Tue Nov 11 11:55:06 CET 2003
Boris 'pi' Piwinger wrote:
> Method 3 training: 2.8M
Recall this was allowing two-byte tokens.
> spam good
> .MSG_COUNT 630 284
> wo (fn): 0.500000 24 22 22 68
> wo (fp): 0.500000 4 4 3 11
> wi (fn): 0.544564 40 30 31 101
> wi (fp): 0.544564 3 1 2 6
> wi (fn): 0.499999 24 22 21 67
> wi (fp): 0.499999 5 4 4 13
> wi (fn): 0.419627 8 12 15 35
> wi (fp): 0.419627 12 8 11 31
> So we see that method 3 (allowing two-byte-tokens) is most
> useful, mainly because it helps identifying ham better.
> Looking at number tokens does not change much, it even
> performs a bit worse.
I have another check now allowing one byte tokens:
TOKEN {TOKENFRONT}({TOKENMID}{TOKENBACK})?
Only 2.6M.
spam good
.MSG_COUNT 561 296
wo (fn): 0.500000 20 18 15 53
wo (fp): 0.500000 6 7 3 16
wi (fn): 0.578271 39 28 32 99
wi (fp): 0.578271 3 1 2 6
wi (fn): 0.503642 21 18 18 57
wi (fp): 0.503642 5 6 3 14
wi (fn): 0.439392 7 10 12 29
wi (fp): 0.439392 11 11 9 31
Very surprisingly it increases the number of false
positives, but adjusted to different targets it performs
better. So what do we do with this result?
So let's see if removing the dollar rule helps:
Now again 2.7M, still pretty small.
spam good
.MSG_COUNT 598 304
wo (fn): 0.500000 24 14 16 54
wo (fp): 0.500000 6 5 2 13
wi (fn): 0.598652 50 38 36 124
wi (fp): 0.598652 3 2 1 6
wi (fn): 0.499987 23 14 15 52
wi (fp): 0.499987 6 6 2 14
wi (fn): 0.450249 12 12 13 37
wi (fp): 0.450249 13 12 6 31
Now, that get's us back into business. The fp's go down.
Again the different targets give contradictory results. Very
confusing.
So to be on the safe side: Two-byte-tokens are clearly
beneficial. The rest is unclear.
pi
More information about the Bogofilter
mailing list