Radical lexers

David Relson relson at osagesoftware.com
Wed Dec 10 16:28:43 CET 2003


On Wed, 10 Dec 2003 16:13:01 +0100
Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> wrote:

> [Corrected version]
> 
> This is a very short test only. I compare my version (a) of
> the lexer (http://piology.org/bogofilter/lexer_v3.l) with a
> much stricter version of it (b). TOKEN will effectively be
> of the form
> [^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]=()+?:#$._!'`~-]+
> 
> So no more difference where in a token a character shows up.
> No punctuation (I hope I did not miss anything). Basically
> letters, digits and characters outside ASCII are allowed.
> 
> And even more extreme (c). Tokens are explicitely: [[:alnum:]]+
> 
> Here is what I get:
>       wordlist  false neg       false pos
> a)    27060k    210/13612       16/15670
> b)    26832k    206/13612       17/15670
> c)    23332k    210/13612       18/15670
> 
> So the size is a surprise. I expected something much smaller
> for b) and even more for c).
> 
> The result for b) hurts. It says (if it can be confirmed)
> that we are doing much too complicated things when defining
> a token. I did really not expect that lexer to work. But
> well, that's how it is.
> 
> c) is really mind-blowing. This simply MUST NOT work.
> 
> pi

As you know, the FAQ exists in french.  Several of the words it uses are
"données", "Problèmes", and "entraîner".  With the [[:alnum:]]+ pattern,
they parse as "donn", "es", "Proble", "mes", "entra", and "ner".  Since
every token in the wordlist has a bunch of database overhead associated
with it, 6 short tokens may well use more space than 3 long ones.

A question:  have your wordlists been compacted?  Using "bogoutil -d
old.db | bogoutil -l new.db" can reduce the size by 40% or more.

With regards to scoring, it's possible that having a larger number word
fragments as tokens works as well as a smaller number of complete
twords.  You'd have to look in detail at the tokens in the messages. 
Unfortunately that's impractical with a large message count.



Consider a french word like "





More information about the Bogofilter mailing list