Radical lexers

Boris 'pi' Piwinger 3.14 at logic.univie.ac.at
Wed Dec 10 16:13:01 CET 2003


[Corrected version]

This is a very short test only. I compare my version (a) of
the lexer (http://piology.org/bogofilter/lexer_v3.l) with a
much stricter version of it (b). TOKEN will effectively be
of the form
[^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]=()+?:#$._!'`~-]+

So no more difference where in a token a character shows up.
No punctuation (I hope I did not miss anything). Basically
letters, digits and characters outside ASCII are allowed.

And even more extreme (c). Tokens are explicitely: [[:alnum:]]+

Here is what I get:
      wordlist  false neg       false pos
a)    27060k    210/13612       16/15670
b)    26832k    206/13612       17/15670
c)    23332k    210/13612       18/15670

So the size is a surprise. I expected something much smaller
for b) and even more for c).

The result for b) hurts. It says (if it can be confirmed)
that we are doing much too complicated things when defining
a token. I did really not expect that lexer to work. But
well, that's how it is.

c) is really mind-blowing. This simply MUST NOT work.

pi




More information about the Bogofilter mailing list