Radical lexers

Tue Jan 20 17:03:19 CET 2004

On Tue, 20 Jan 2004 15:40:46 +0100
Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> wrote:

> Boris 'pi' Piwinger wrote:
> 
> > This is a very short test only. I compare my version (a) of
> > the lexer (http://piology.org/bogofilter/lexer_v3.l) with a
> > much stricter version of it (b). TOKEN will effectively be
> > of the form
> > [^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]=()+?:#$._!'`~-]+
> 
> This has been more than a month ago. When 0.16 came out I
> started using this lexer (see http://piology.org/bogofilter/
> for the radical lexer) into production. I have to say that I
> am totally satisfied.
> 
> So the main difference is that TOKEN is much simpler. In
> effect, tokens will in average be shorter, since they are
> split up here where they are not with the standard lexer
> (my-lexer will be one token in the latter, but two in the
> former).
> 
> Another side effect is that some rules become simpler by the
> shorter TOKEN definition, but that should not change the
> parsing.
> 
> Also some special rules are simply dropped (the $-rule, the
> DOCTYPE switch)
> 
> pi

Hi pi,

Out of curiosity, I built bogolexers with the standard lexer_v3.l and
yours and tested the effects by parsing the two versions of lexer_v3.l
and the two mailboxes used in bogofilter's regression test.  Here's the
script I used:

  #!/bin/sh

  FILES="lexer_v3.l.bf lexer_v3.l.pi tests/inputs/good.mbx
tests/inputs/spam.mbx"

  for NAME in bf pi ; do
      echo $NAME
      cp -f lexer_v3.l.$NAME lexer_v3.l
      for P in bogofilter bogolexer ; do
  	  make -s $P
	  mv -f $P $P-$NAME
      done
      cat $FILES | bogolexer-$NAME -H -p | sort -u | wc -l
  done

  gtkdiff bf.tmp pi.tmp

File bf.tmp, generated by the standard lexer, has 5005 lines (tokens) in
it.  File pi.tmp, generated by your lexer, has 5918 lines.  That's an
increase of almost 1/5.  Many of the differences are tokens like
0.408692, 0.410978, 0.412559, 0.412734, 0.413214, 0.416318, 0.418804,
etc. which seem unlikely to recur.

I'd warrant that your wordlists have a lot of hapaxes (tokens that have
occurred once and only once) taking up space.  This seems contrary to
your efforts to minimize wordlist size :-(

David