Radical lexers

Wed Dec 10 17:02:07 CET 2003

On Wed, 10 Dec 2003 16:35:40 +0100
Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> wrote:

> David Relson wrote:
> 
> > As you know, the FAQ exists in french.  Several of the words it uses
> > are"données", "Problèmes", and "entraîner".  With the [[:alnum:]]+
> > pattern, they parse as "donn", "es", "Proble", "mes", "entra", and
> > "ner".  Since every token in the wordlist has a bunch of database
> > overhead associated with it, 6 short tokens may well use more space
> > than 3 long ones.
> 
> That could explain size effects. Good.
> 
> > A question:  have your wordlists been compacted?  Using "bogoutil -d
> > old.db | bogoutil -l new.db" can reduce the size by 40% or more.
> 
> No, those lists were just build with -s and -n.

Dump/load produces the smallest database.  Since dump writes the tokens
in order, the load process can simply write the database.  When using -s
and -n, data blocks get split which uses space.  (c) has the largest
number of tokens, so probably has the largest number of split blocks.

> > With regards to scoring, it's possible that having a larger number
> > word fragments as tokens works as well as a smaller number of
> > complete twords. 
> 
> Could be. Maybe we never noticed it since tokens of lenght
> one and two are not allowed.

At a rough count, there are less than 200 single character tokens (256
characters, less 32 control characters, 25 or so special symbols).  Have
you ever looked at their spam/ham counts?  Are any of them significantly
hammish or spammish?  Running 'bogoutil -d wordlist.db | grep "^? "'
would list them all.