lexer size [was: Much simplified lexer]
    David Relson 
    relson at osagesoftware.com
       
    Thu Nov 13 16:15:12 CET 2003
    
    
  
On Thu, 13 Nov 2003 15:47:37 +0100
Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> wrote:
> David Relson wrote:
...[snip]...
> 
> Right, looks great, also the unified use of character
> classes and a lot of those \ is suggested. But none of those
> make a difference.
I've been thinking about backslashes.  I'm not planning to change their
present usage, but if I was to make a change I'd do the following:
List the characters that can be special, i.e. operators.  This would
include *?+-[] (and others).  When a pattern uses one of them
_as_a_character_, I'd put in a backslash.  This would make it clear when
the rule wants a character and when a rule wants an operator.  The
uniform usage should make it much easier to understand the rules. 
Unfortunately this would probably add a lot of backslashes.
> Here is the change which does make the difference. You
> changed this line:
> <INITIAL>(ESMTP|SMTP)+/[ \t\n]+id\ {ID}  
> to that line:
> <INITIAL>(ESMTP|SMTP)+
Right.  This change:
- <INITIAL>(ESMTP|SMTP)+/[ \t\n]+id\ {ID}  
+ <INITIAL>(ESMTP|SMTP)+
has the following size effects:
     100827 Nov 13 10:08 lexer_v3.c
     104516 Nov 13 10:11 lexer_v3.c
  41547	      8	     60	  41615	   a28f	lexer_v3.o
  43405	      8	  65640	 109053	  1a9fd	lexer_v3.o
> I don't understand what this is good for. In the original
> expression the / seems to be wrong, maybe the space behind
> "id" should also be any kind of whitespace. But why
> completely remove it?
Actually Matthias made this change.  The old pattern allows a multiline
ESMTP|SMTP line, the new line does not.  In his test, the simpler
pattern works fine.
Also, the "/" is a lexer operator that causes it to return the matching
text before the slash.  The text after the slash will be reparsed
(later).
> Anyhow, wouldn't the following be nicer:
> <INITIAL>(E?SMTP)+
Looks reasonable.
> And why the +? I only see it in the form "with ESMTP id
> PAA16337" etc., no repeated SMTP or ESMTP. So I would have
> assumed that version:
> <INITIAL>E?SMTP{WHITESPACE}+{WHITESPACE}id{ID}
Appears to be unnecessary.  Without it, "make check" still succeeds so
the effect of the change is minor.
    
    
More information about the bogofilter
mailing list