Understanding lexer_v3.l changes

Sun Nov 26 18:28:49 CET 2006

David Relson <relson at osagesoftware.com> wrote:

>> I just try to understand the recent changes in lexer_v3.l:
>> 
>> :< /* $Id: lexer_v3.l,v 1.162 2005/06/27 00:40:48 relson Exp $ */
>> :> /* $Id: lexer_v3.l,v 1.167 2006/07/04 03:47:37 relson Exp $ */
>> 
>> So this is 1.0.3 vs 1.1.1
>> 
>> :< ID       <?[[:alnum:]-]*>?
>> :> ID       <?[[:alnum:]\-\.]*>?
>> 
>> What is the new dot good for? CVS has "Cleanup queue-id
>> processing." as a comment. I am not sure what it relates to,
>> but the long comment in the beginning of lexer_v3.1 says
>> something about avoiding dots.
>
>It allows dots within IDs.

Obviously, it does, but why? It used to work without the
dots for ages.

>> BTW, what was the reason, that TOKEN is not allowed to start
>> with one digit, but may contain digits inside?
>
>This makes "A123" a valid token while "1234" is not a valid
>token.  Allowing tokens that are totally numeric would be a
>bad thing, no?

Actually, I had removed this for my version of the lexer
long ago. I did not have any significant effect for me.
While I don't recall if I have tested this feature alone, I
could not find any problem with the simplification in some
tests, but this is really long ago now:
http://piology.org/bogofilter/#tests

>> :< <HTOKEN>{TOKEN}                                       { return
>> TOKEN; } :> <HTOKEN>({TOKEN}|{SHORT_TOKEN})
>> { return TOKEN; } :< {TOKEN}
>> { return TOKEN;} :>
>> ({TOKEN}|{SHORT_TOKEN})                               { return TOKEN;}
>> 
>> Why not define TOKEN in the first place like this:
>> {TOKENFRONT}({TOKENMID}{TOKENBACK})? and TOKENMID with a *
>> instead of a + in the end?
>
>As best I can tell, your suggestions add up to "SHORT_TOKEN isn't
>necessary. A few changes to TOKEN can eliminate it."  Even if that's
>not exactly what you're thinking, I've eliminated SHORT_TOKEN without
>breaking "make check".

Actually, this is exactly what I thought would be possible.

In my version in addition I made TOKENFRONT, TOKENMID and
TOKENBACK the same. This *does* *change* the meaning, but
works perfectly for me while reducing complexity.

>With the suggested changes to TOKEN and TOKENMID, it seems that TOKEN
>works fine wherever TOKEN_12 is used, i.e. that T12 and TOKEN_12 can
>be eliminated.  Right?

I believe you are correct since a string matching TOKEN_12,
i.e. an alpahabetic character followed by an alphanumeric
character, would already match TOKEN. In that case TOKEN_12
will never be tested. Indeed, this is what I have:
http://piology.org/bogofilter/lexer_v3.l.radical

>> :< \${NUM}(\.{NUM})?                             { return
>> TOKEN;}        /* Dollars and cents */ :>
>> \${NUM}(\.{NUM})?                             { return
>> MONEY;}        /* Dollars and cents */
>> 
>> What is the new return code good for? But anyhow, for me
>> those would be normal tokens;-)
>
>File token.c had some special processing to allow 2 character money
>tokens, i.e. "$1", "$2", etc.  The MONEY code allows a cleaner
>implementation of this special case.

I see. I had removed this clause and don't allow $ in TOKEN
at all. Maybe it should be retested if the additional
complexity of currency handling does add any benefit.

>I've attached a patch file with the changes from 1.1.1 to current cvs
>for lexer_v3.l and token.c.  If you have further improvements (that
>don't break "make check"), I'm all ears.

Should be most. The other ones will change the meaning which
may or may not be beneficial.

You might want to try `diff -b -B` against my lexer for more
ideas.

What I do see (no idea when we went different directions) is
<HTOKEN>({TOKEN}) vs. <HTOKEN>{TOKEN} and the same without
<HTOKEN>. Does this make any difference? If not I would go
for the simpler version. 

pi