Understanding lexer_v3.l changes
Boris 'pi' Piwinger
3.14 at piology.org
Sun Nov 26 18:28:49 CET 2006
David Relson <relson at osagesoftware.com> wrote:
>> I just try to understand the recent changes in lexer_v3.l:
>>
>> :< /* $Id: lexer_v3.l,v 1.162 2005/06/27 00:40:48 relson Exp $ */
>> :> /* $Id: lexer_v3.l,v 1.167 2006/07/04 03:47:37 relson Exp $ */
>>
>> So this is 1.0.3 vs 1.1.1
>>
>> :< ID <?[[:alnum:]-]*>?
>> :> ID <?[[:alnum:]\-\.]*>?
>>
>> What is the new dot good for? CVS has "Cleanup queue-id
>> processing." as a comment. I am not sure what it relates to,
>> but the long comment in the beginning of lexer_v3.1 says
>> something about avoiding dots.
>
>It allows dots within IDs.
Obviously, it does, but why? It used to work without the
dots for ages.
>> BTW, what was the reason, that TOKEN is not allowed to start
>> with one digit, but may contain digits inside?
>
>This makes "A123" a valid token while "1234" is not a valid
>token. Allowing tokens that are totally numeric would be a
>bad thing, no?
Actually, I had removed this for my version of the lexer
long ago. I did not have any significant effect for me.
While I don't recall if I have tested this feature alone, I
could not find any problem with the simplification in some
tests, but this is really long ago now:
http://piology.org/bogofilter/#tests
>> :< <HTOKEN>{TOKEN} { return
>> TOKEN; } :> <HTOKEN>({TOKEN}|{SHORT_TOKEN})
>> { return TOKEN; } :< {TOKEN}
>> { return TOKEN;} :>
>> ({TOKEN}|{SHORT_TOKEN}) { return TOKEN;}
>>
>> Why not define TOKEN in the first place like this:
>> {TOKENFRONT}({TOKENMID}{TOKENBACK})? and TOKENMID with a *
>> instead of a + in the end?
>
>As best I can tell, your suggestions add up to "SHORT_TOKEN isn't
>necessary. A few changes to TOKEN can eliminate it." Even if that's
>not exactly what you're thinking, I've eliminated SHORT_TOKEN without
>breaking "make check".
Actually, this is exactly what I thought would be possible.
In my version in addition I made TOKENFRONT, TOKENMID and
TOKENBACK the same. This *does* *change* the meaning, but
works perfectly for me while reducing complexity.
>With the suggested changes to TOKEN and TOKENMID, it seems that TOKEN
>works fine wherever TOKEN_12 is used, i.e. that T12 and TOKEN_12 can
>be eliminated. Right?
I believe you are correct since a string matching TOKEN_12,
i.e. an alpahabetic character followed by an alphanumeric
character, would already match TOKEN. In that case TOKEN_12
will never be tested. Indeed, this is what I have:
http://piology.org/bogofilter/lexer_v3.l.radical
>> :< \${NUM}(\.{NUM})? { return
>> TOKEN;} /* Dollars and cents */ :>
>> \${NUM}(\.{NUM})? { return
>> MONEY;} /* Dollars and cents */
>>
>> What is the new return code good for? But anyhow, for me
>> those would be normal tokens;-)
>
>File token.c had some special processing to allow 2 character money
>tokens, i.e. "$1", "$2", etc. The MONEY code allows a cleaner
>implementation of this special case.
I see. I had removed this clause and don't allow $ in TOKEN
at all. Maybe it should be retested if the additional
complexity of currency handling does add any benefit.
>I've attached a patch file with the changes from 1.1.1 to current cvs
>for lexer_v3.l and token.c. If you have further improvements (that
>don't break "make check"), I'm all ears.
Should be most. The other ones will change the meaning which
may or may not be beneficial.
You might want to try `diff -b -B` against my lexer for more
ideas.
What I do see (no idea when we went different directions) is
<HTOKEN>({TOKEN}) vs. <HTOKEN>{TOKEN} and the same without
<HTOKEN>. Does this make any difference? If not I would go
for the simpler version.
pi
More information about the Bogofilter
mailing list