Radical lexers
Boris 'pi' Piwinger
3.14 at logic.univie.ac.at
Tue Jan 20 17:47:20 CET 2004
David Relson wrote:
> Out of curiosity, I built bogolexers with the standard lexer_v3.l and
> yours
Just to be sure. Did you use
http://piology.org/bogofilter/lexer_v3.l or
http://piology.org/bogofilter/lexer_v3.l.radical (I use the
latter now).
> File bf.tmp, generated by the standard lexer, has 5005 lines (tokens) in
> it. File pi.tmp, generated by your lexer, has 5918 lines. That's an
> increase of almost 1/5. Many of the differences are tokens like
> 0.408692, 0.410978, 0.412559, 0.412734, 0.413214, 0.416318, 0.418804,
> etc. which seem unlikely to recur.
As a first shot I would assume there are also a lot of 1-
and 2-byte-tokens. Actually, numbers were useful in my tests
IIRC. At least I could not refute that.
But looking at your example it seems you took the former
lexer of mine. The latter does not allow period in a token.
> I'd warrant that your wordlists have a lot of hapaxes (tokens that have
> occurred once and only once) taking up space.
I cannot really answer that since by my training this is not
as unusual as opposed to full training
> This seems contrary to
> your efforts to minimize wordlist size :-(
Changing the lexer was not really to minimize wordlists. It
was more for the pure sake of simplicity. If this is good or
bad is not clear from itself. I just can say it works for me.
pi
More information about the Bogofilter
mailing list