Radical lexers
Boris 'pi' Piwinger
3.14 at logic.univie.ac.at
Wed Dec 10 16:13:01 CET 2003
[Corrected version]
This is a very short test only. I compare my version (a) of
the lexer (http://piology.org/bogofilter/lexer_v3.l) with a
much stricter version of it (b). TOKEN will effectively be
of the form
[^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]=()+?:#$._!'`~-]+
So no more difference where in a token a character shows up.
No punctuation (I hope I did not miss anything). Basically
letters, digits and characters outside ASCII are allowed.
And even more extreme (c). Tokens are explicitely: [[:alnum:]]+
Here is what I get:
wordlist false neg false pos
a) 27060k 210/13612 16/15670
b) 26832k 206/13612 17/15670
c) 23332k 210/13612 18/15670
So the size is a surprise. I expected something much smaller
for b) and even more for c).
The result for b) hurts. It says (if it can be confirmed)
that we are doing much too complicated things when defining
a token. I did really not expect that lexer to work. But
well, that's how it is.
c) is really mind-blowing. This simply MUST NOT work.
pi
More information about the Bogofilter
mailing list