lexer size [was: Much simplified lexer]

David Relson relson at osagesoftware.com
Thu Nov 13 16:15:12 CET 2003


On Thu, 13 Nov 2003 15:47:37 +0100
Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> wrote:

> David Relson wrote:
...[snip]...
> 
> Right, looks great, also the unified use of character
> classes and a lot of those \ is suggested. But none of those
> make a difference.

I've been thinking about backslashes.  I'm not planning to change their
present usage, but if I was to make a change I'd do the following:

List the characters that can be special, i.e. operators.  This would
include *?+-[] (and others).  When a pattern uses one of them
_as_a_character_, I'd put in a backslash.  This would make it clear when
the rule wants a character and when a rule wants an operator.  The
uniform usage should make it much easier to understand the rules. 
Unfortunately this would probably add a lot of backslashes.


> Here is the change which does make the difference. You
> changed this line:
> <INITIAL>(ESMTP|SMTP)+/[ \t\n]+id\ {ID}  
> to that line:
> <INITIAL>(ESMTP|SMTP)+

Right.  This change:

- <INITIAL>(ESMTP|SMTP)+/[ \t\n]+id\ {ID}  
+ <INITIAL>(ESMTP|SMTP)+

has the following size effects:

     100827 Nov 13 10:08 lexer_v3.c
     104516 Nov 13 10:11 lexer_v3.c

  41547	      8	     60	  41615	   a28f	lexer_v3.o
  43405	      8	  65640	 109053	  1a9fd	lexer_v3.o

> I don't understand what this is good for. In the original
> expression the / seems to be wrong, maybe the space behind
> "id" should also be any kind of whitespace. But why
> completely remove it?

Actually Matthias made this change.  The old pattern allows a multiline
ESMTP|SMTP line, the new line does not.  In his test, the simpler
pattern works fine.

Also, the "/" is a lexer operator that causes it to return the matching
text before the slash.  The text after the slash will be reparsed
(later).

> Anyhow, wouldn't the following be nicer:
> <INITIAL>(E?SMTP)+

Looks reasonable.

> And why the +? I only see it in the form "with ESMTP id
> PAA16337" etc., no repeated SMTP or ESMTP. So I would have
> assumed that version:
> <INITIAL>E?SMTP{WHITESPACE}+{WHITESPACE}id{ID}

Appears to be unnecessary.  Without it, "make check" still succeeds so
the effect of the change is minor.




More information about the Bogofilter mailing list