Radical lexers (was: Test with different lexers)

Boris 'pi' Piwinger 3.14 at logic.univie.ac.at
Wed Dec 10 14:51:43 CET 2003


Boris 'pi' Piwinger wrote:

> Next I will test (I don't promise any time too soon) is not
> allowing any punctuation at all.

OK, this is a very short test only (I spent already too much
time on bogofilter the last two days;-). I compare my
version of the lexer
http://piology.org/bogofilter/lexer_v3.l with a much
stricter version of it. TOKEN will effectively of the form
[^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]=()+?:#$._!'`~-]+

So no more difference from where in a token a character
shows up. NO punctuation (I hope I did not miss anything).
So basically letters, digits and characters outside ASCII
are allowed.

And even more extreme. Tokens are explicitely: [[:alnum:]]+

Here is what I get:

a) my version of the lexer
   27060k
   fn=210/13612
   fp=16/15670

b) radical lexer
   26832k
   fn=206/13612
   fp=17/15670

c) most radical lexer
   23332k
   fn=210/13612
   fp=17/15670

So the size is a surprise. I expected something much smaller
for b) and even more for c).

The result for b) hurts. It says (if it can be confirmed)
that we are doing much too complicated things when defining
a token. I did really not expect that lexer to work. But
well, that's how it is.

c) is really mind-blowing. This simply MUST NOT work.

pi




More information about the Bogofilter mailing list