Radical lexers (was: Test with different lexers)

David Relson relson at osagesoftware.com
Wed Dec 10 15:20:26 CET 2003


On Wed, 10 Dec 2003 14:51:43 +0100
Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> wrote:

> Boris 'pi' Piwinger wrote:
> 
> > Next I will test (I don't promise any time too soon) is not
> > allowing any punctuation at all.
> 
> OK, this is a very short test only (I spent already too much
> time on bogofilter the last two days;-). I compare my
> version of the lexer
> http://piology.org/bogofilter/lexer_v3.l with a much
> stricter version of it. TOKEN will effectively of the form
> [^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]=()+?:#$._!'`~-]+
> 
> So no more difference from where in a token a character
> shows up. NO punctuation (I hope I did not miss anything).
> So basically letters, digits and characters outside ASCII
> are allowed.
> 
> And even more extreme. Tokens are explicitely: [[:alnum:]]+
> 
> Here is what I get:
> 
> a) my version of the lexer
>    27060k
>    fn=210/13612
>    fp=16/15670
> 
> b) radical lexer
>    26832k
>    fn=206/13612
>    fp=17/15670
> 
> c) most radical lexer
>    23332k
>    fn=210/13612
>    fp=17/15670

As I've often requested, a table like the following is much easier to
read:

      wordlist  false neg       false pos
a)    27060k    210/13612       16/15670   
b)    26832k    206/13612       17/15670   
c)    23332k    210/13612       17/15670   

a - pi's version
b - radical version
c - strictly alphanumeric

With all the sizes in one column, there's no need to scan the sets of
numbers to pick out the one being compared.

> So the size is a surprise. I expected something much smaller
> for b) and even more for c).
> 
> The result for b) hurts. It says (if it can be confirmed)
> that we are doing much too complicated things when defining
> a token. I did really not expect that lexer to work. But
> well, that's how it is.
> 
> c) is really mind-blowing. This simply MUST NOT work.

It probably means that enough of your tokens are strictly alphanumeric
that the others don't matter.

> 




More information about the Bogofilter mailing list