Radical lexers

Tue Jan 20 17:47:20 CET 2004

David Relson wrote:

> Out of curiosity, I built bogolexers with the standard lexer_v3.l and
> yours

Just to be sure. Did you use
http://piology.org/bogofilter/lexer_v3.l or
http://piology.org/bogofilter/lexer_v3.l.radical (I use the
latter now).

> File bf.tmp, generated by the standard lexer, has 5005 lines (tokens) in
> it.  File pi.tmp, generated by your lexer, has 5918 lines.  That's an
> increase of almost 1/5.  Many of the differences are tokens like
> 0.408692, 0.410978, 0.412559, 0.412734, 0.413214, 0.416318, 0.418804,
> etc. which seem unlikely to recur.

As a first shot I would assume there are also a lot of 1-
and 2-byte-tokens. Actually, numbers were useful in my tests
IIRC. At least I could not refute that.

But looking at your example it seems you took the former
lexer of mine. The latter does not allow period in a token.

> I'd warrant that your wordlists have a lot of hapaxes (tokens that have
> occurred once and only once) taking up space. 

I cannot really answer that since by my training this is not
as unusual as opposed to full training

> This seems contrary to
> your efforts to minimize wordlist size :-(

Changing the lexer was not really to minimize wordlists. It
was more for the pure sake of simplicity. If this is good or
bad is not clear from itself. I just can say it works for me.

pi