Radical lexers

Wed Dec 10 23:29:08 CET 2003

"Boris 'pi' Piwinger" <3.14 at logic.univie.ac.at> writes:
> [Corrected version]
> 
> This is a very short test only. I compare my version (a) of
> the lexer (http://piology.org/bogofilter/lexer_v3.l) with a
> much stricter version of it (b). TOKEN will effectively be
> of the form
> [^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]=()+?:#$._!'`~-]+
[...] 
> Here is what I get:
>       wordlist  false neg       false pos
> a)    27060k    210/13612       16/15670
> b)    26832k    206/13612       17/15670
> c)    23332k    210/13612       18/15670
> 
> So the size is a surprise. I expected something much smaller
> for b) and even more for c).

This isn't super suprising. You're testing with a small corpus,
on a very easy data set. You're well down in the noise level
on both fp's and fn's.

I'd be curious to see the difference with a tougher dataset
(specifically, a dataset that includes hams to many
people :)

Michael.