Radical lexers

Wed Dec 10 16:35:40 CET 2003

David Relson wrote:

> As you know, the FAQ exists in french.  Several of the words it uses are
> "données", "Problèmes", and "entraîner".  With the [[:alnum:]]+ pattern,
> they parse as "donn", "es", "Proble", "mes", "entra", and "ner".  Since
> every token in the wordlist has a bunch of database overhead associated
> with it, 6 short tokens may well use more space than 3 long ones.

That could explain size effects. Good.

> A question:  have your wordlists been compacted?  Using "bogoutil -d
> old.db | bogoutil -l new.db" can reduce the size by 40% or more.

No, those lists were just build with -s and -n.

> With regards to scoring, it's possible that having a larger number word
> fragments as tokens works as well as a smaller number of complete
> twords. 

Could be. Maybe we never noticed it since tokens of lenght
one and two are not allowed.

> You'd have to look in detail at the tokens in the messages. 
> Unfortunately that's impractical with a large message count.

Absolutely. It would just be interesting to see other people
test those lexers.

BTW: I tried to do Tom's version. It just did not do
anything. I changed it to: [^[:blank:][:cntrl:]\n]+ and it
still fails. I won't have the time to investigate. If
someone comes up with a better attempt, I can run it.

pi