Radical lexers

Tue Jan 20 17:55:27 CET 2004

On Tue, 20 Jan 2004 17:47:20 +0100
Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> wrote:

> David Relson wrote:
> 
> > Out of curiosity, I built bogolexers with the standard lexer_v3.l
> > and yours
> 
> Just to be sure. Did you use
> http://piology.org/bogofilter/lexer_v3.l or
> http://piology.org/bogofilter/lexer_v3.l.radical (I use the
> latter now).

The former, rather than the latter.

> > File bf.tmp, generated by the standard lexer, has 5005 lines
> > (tokens) in it.  File pi.tmp, generated by your lexer, has 5918
> > lines.  That's an increase of almost 1/5.  Many of the differences
> > are tokens like 0.408692, 0.410978, 0.412559, 0.412734, 0.413214,
> > 0.416318, 0.418804, etc. which seem unlikely to recur.
> 
> As a first shot I would assume there are also a lot of 1-
> and 2-byte-tokens. Actually, numbers were useful in my tests
> IIRC. At least I could not refute that.
> 
> But looking at your example it seems you took the former
> lexer of mine. The latter does not allow period in a token.

Disallowing periods would change the tokens to 408692, 410978, 412559,
412734, 413214, 416318, 418804, etc -- not a significant difference,
AFAICT.

...[snip]...

> > This seems contrary to
> > your efforts to minimize wordlist size :-(
> 
> Changing the lexer was not really to minimize wordlists. It
> was more for the pure sake of simplicity. If this is good or
> bad is not clear from itself. I just can say it works for me.
> 
> pi

It does seem that almost any reasonable lexer produces good results.  A
year ago, bogofilter was doing fine without knowledge of mime, html,
etc.  It seems that lexer details aren't all that important.