Radical lexers

Wed Dec 10 17:07:42 CET 2003

David Relson wrote:

> Dump/load produces the smallest database.  Since dump writes the tokens
> in order, the load process can simply write the database.  When using -s
> and -n, data blocks get split which uses space.  (c) has the largest
> number of tokens, so probably has the largest number of split blocks.

I could look at this tomorrow.

>> > With regards to scoring, it's possible that having a larger number
>> > word fragments as tokens works as well as a smaller number of
>> > complete twords. 
>> 
>> Could be. Maybe we never noticed it since tokens of lenght
>> one and two are not allowed.
> 
> At a rough count, there are less than 200 single character tokens (256
> characters, less 32 control characters, 25 or so special symbols).  Have
> you ever looked at their spam/ham counts?  Are any of them significantly
> hammish or spammish?  Running 'bogoutil -d wordlist.db | grep "^? "'
> would list them all.

Yes, but not for this lexer. I turned out that there are
significant tokens of length one, but very few, more for two
bytes. But this lexer will certainly produce different results.

What it makes hard to see their value is that you cannot
simply see it with bogoutil.

pi