Radical lexers
michael at optusnet.com.au
michael at optusnet.com.au
Wed Dec 10 23:29:08 CET 2003
"Boris 'pi' Piwinger" <3.14 at logic.univie.ac.at> writes:
> [Corrected version]
>
> This is a very short test only. I compare my version (a) of
> the lexer (http://piology.org/bogofilter/lexer_v3.l) with a
> much stricter version of it (b). TOKEN will effectively be
> of the form
> [^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]=()+?:#$._!'`~-]+
[...]
> Here is what I get:
> wordlist false neg false pos
> a) 27060k 210/13612 16/15670
> b) 26832k 206/13612 17/15670
> c) 23332k 210/13612 18/15670
>
> So the size is a surprise. I expected something much smaller
> for b) and even more for c).
This isn't super suprising. You're testing with a small corpus,
on a very easy data set. You're well down in the noise level
on both fp's and fn's.
I'd be curious to see the difference with a tougher dataset
(specifically, a dataset that includes hams to many
people :)
Michael.
More information about the Bogofilter
mailing list