Radical lexers
David Relson
relson at osagesoftware.com
Wed Dec 10 16:28:43 CET 2003
On Wed, 10 Dec 2003 16:13:01 +0100
Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> wrote:
> [Corrected version]
>
> This is a very short test only. I compare my version (a) of
> the lexer (http://piology.org/bogofilter/lexer_v3.l) with a
> much stricter version of it (b). TOKEN will effectively be
> of the form
> [^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]=()+?:#$._!'`~-]+
>
> So no more difference where in a token a character shows up.
> No punctuation (I hope I did not miss anything). Basically
> letters, digits and characters outside ASCII are allowed.
>
> And even more extreme (c). Tokens are explicitely: [[:alnum:]]+
>
> Here is what I get:
> wordlist false neg false pos
> a) 27060k 210/13612 16/15670
> b) 26832k 206/13612 17/15670
> c) 23332k 210/13612 18/15670
>
> So the size is a surprise. I expected something much smaller
> for b) and even more for c).
>
> The result for b) hurts. It says (if it can be confirmed)
> that we are doing much too complicated things when defining
> a token. I did really not expect that lexer to work. But
> well, that's how it is.
>
> c) is really mind-blowing. This simply MUST NOT work.
>
> pi
As you know, the FAQ exists in french. Several of the words it uses are
"données", "Problèmes", and "entraîner". With the [[:alnum:]]+ pattern,
they parse as "donn", "es", "Proble", "mes", "entra", and "ner". Since
every token in the wordlist has a bunch of database overhead associated
with it, 6 short tokens may well use more space than 3 long ones.
A question: have your wordlists been compacted? Using "bogoutil -d
old.db | bogoutil -l new.db" can reduce the size by 40% or more.
With regards to scoring, it's possible that having a larger number word
fragments as tokens works as well as a smaller number of complete
twords. You'd have to look in detail at the tokens in the messages.
Unfortunately that's impractical with a large message count.
Consider a french word like "
More information about the Bogofilter
mailing list