Radical lexers
David Relson
relson at osagesoftware.com
Wed Dec 10 17:02:07 CET 2003
On Wed, 10 Dec 2003 16:35:40 +0100
Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> wrote:
> David Relson wrote:
>
> > As you know, the FAQ exists in french. Several of the words it uses
> > are"données", "Problèmes", and "entraîner". With the [[:alnum:]]+
> > pattern, they parse as "donn", "es", "Proble", "mes", "entra", and
> > "ner". Since every token in the wordlist has a bunch of database
> > overhead associated with it, 6 short tokens may well use more space
> > than 3 long ones.
>
> That could explain size effects. Good.
>
> > A question: have your wordlists been compacted? Using "bogoutil -d
> > old.db | bogoutil -l new.db" can reduce the size by 40% or more.
>
> No, those lists were just build with -s and -n.
Dump/load produces the smallest database. Since dump writes the tokens
in order, the load process can simply write the database. When using -s
and -n, data blocks get split which uses space. (c) has the largest
number of tokens, so probably has the largest number of split blocks.
> > With regards to scoring, it's possible that having a larger number
> > word fragments as tokens works as well as a smaller number of
> > complete twords.
>
> Could be. Maybe we never noticed it since tokens of lenght
> one and two are not allowed.
At a rough count, there are less than 200 single character tokens (256
characters, less 32 control characters, 25 or so special symbols). Have
you ever looked at their spam/ham counts? Are any of them significantly
hammish or spammish? Running 'bogoutil -d wordlist.db | grep "^? "'
would list them all.
More information about the Bogofilter
mailing list