lexer size [was: Much simplified lexer]
David Relson
relson at osagesoftware.com
Thu Nov 13 16:15:12 CET 2003
On Thu, 13 Nov 2003 15:47:37 +0100
Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> wrote:
> David Relson wrote:
...[snip]...
>
> Right, looks great, also the unified use of character
> classes and a lot of those \ is suggested. But none of those
> make a difference.
I've been thinking about backslashes. I'm not planning to change their
present usage, but if I was to make a change I'd do the following:
List the characters that can be special, i.e. operators. This would
include *?+-[] (and others). When a pattern uses one of them
_as_a_character_, I'd put in a backslash. This would make it clear when
the rule wants a character and when a rule wants an operator. The
uniform usage should make it much easier to understand the rules.
Unfortunately this would probably add a lot of backslashes.
> Here is the change which does make the difference. You
> changed this line:
> <INITIAL>(ESMTP|SMTP)+/[ \t\n]+id\ {ID}
> to that line:
> <INITIAL>(ESMTP|SMTP)+
Right. This change:
- <INITIAL>(ESMTP|SMTP)+/[ \t\n]+id\ {ID}
+ <INITIAL>(ESMTP|SMTP)+
has the following size effects:
100827 Nov 13 10:08 lexer_v3.c
104516 Nov 13 10:11 lexer_v3.c
41547 8 60 41615 a28f lexer_v3.o
43405 8 65640 109053 1a9fd lexer_v3.o
> I don't understand what this is good for. In the original
> expression the / seems to be wrong, maybe the space behind
> "id" should also be any kind of whitespace. But why
> completely remove it?
Actually Matthias made this change. The old pattern allows a multiline
ESMTP|SMTP line, the new line does not. In his test, the simpler
pattern works fine.
Also, the "/" is a lexer operator that causes it to return the matching
text before the slash. The text after the slash will be reparsed
(later).
> Anyhow, wouldn't the following be nicer:
> <INITIAL>(E?SMTP)+
Looks reasonable.
> And why the +? I only see it in the form "with ESMTP id
> PAA16337" etc., no repeated SMTP or ESMTP. So I would have
> assumed that version:
> <INITIAL>E?SMTP{WHITESPACE}+{WHITESPACE}id{ID}
Appears to be unnecessary. Without it, "make check" still succeeds so
the effect of the change is minor.
More information about the Bogofilter
mailing list