Radical lexers (was: Test with different lexers)
David Relson
relson at osagesoftware.com
Wed Dec 10 15:20:26 CET 2003
On Wed, 10 Dec 2003 14:51:43 +0100
Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> wrote:
> Boris 'pi' Piwinger wrote:
>
> > Next I will test (I don't promise any time too soon) is not
> > allowing any punctuation at all.
>
> OK, this is a very short test only (I spent already too much
> time on bogofilter the last two days;-). I compare my
> version of the lexer
> http://piology.org/bogofilter/lexer_v3.l with a much
> stricter version of it. TOKEN will effectively of the form
> [^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]=()+?:#$._!'`~-]+
>
> So no more difference from where in a token a character
> shows up. NO punctuation (I hope I did not miss anything).
> So basically letters, digits and characters outside ASCII
> are allowed.
>
> And even more extreme. Tokens are explicitely: [[:alnum:]]+
>
> Here is what I get:
>
> a) my version of the lexer
> 27060k
> fn=210/13612
> fp=16/15670
>
> b) radical lexer
> 26832k
> fn=206/13612
> fp=17/15670
>
> c) most radical lexer
> 23332k
> fn=210/13612
> fp=17/15670
As I've often requested, a table like the following is much easier to
read:
wordlist false neg false pos
a) 27060k 210/13612 16/15670
b) 26832k 206/13612 17/15670
c) 23332k 210/13612 17/15670
a - pi's version
b - radical version
c - strictly alphanumeric
With all the sizes in one column, there's no need to scan the sets of
numbers to pick out the one being compared.
> So the size is a surprise. I expected something much smaller
> for b) and even more for c).
>
> The result for b) hurts. It says (if it can be confirmed)
> that we are doing much too complicated things when defining
> a token. I did really not expect that lexer to work. But
> well, that's how it is.
>
> c) is really mind-blowing. This simply MUST NOT work.
It probably means that enough of your tokens are strictly alphanumeric
that the others don't matter.
>
More information about the Bogofilter
mailing list