Radical lexers
Boris 'pi' Piwinger
3.14 at logic.univie.ac.at
Wed Dec 10 16:35:40 CET 2003
David Relson wrote:
> As you know, the FAQ exists in french. Several of the words it uses are
> "données", "Problèmes", and "entraîner". With the [[:alnum:]]+ pattern,
> they parse as "donn", "es", "Proble", "mes", "entra", and "ner". Since
> every token in the wordlist has a bunch of database overhead associated
> with it, 6 short tokens may well use more space than 3 long ones.
That could explain size effects. Good.
> A question: have your wordlists been compacted? Using "bogoutil -d
> old.db | bogoutil -l new.db" can reduce the size by 40% or more.
No, those lists were just build with -s and -n.
> With regards to scoring, it's possible that having a larger number word
> fragments as tokens works as well as a smaller number of complete
> twords.
Could be. Maybe we never noticed it since tokens of lenght
one and two are not allowed.
> You'd have to look in detail at the tokens in the messages.
> Unfortunately that's impractical with a large message count.
Absolutely. It would just be interesting to see other people
test those lexers.
BTW: I tried to do Tom's version. It just did not do
anything. I changed it to: [^[:blank:][:cntrl:]\n]+ and it
still fails. I won't have the time to investigate. If
someone comes up with a better attempt, I can run it.
pi
More information about the Bogofilter
mailing list