Understanding lexer_v3.l changes
David Relson
relson at osagesoftware.com
Sun Nov 26 17:49:05 CET 2006
On Sun, 26 Nov 2006 16:47:35 +0100 Boris 'pi' Piwinger wrote:
> Hi!
>
> I just try to understand the recent changes in lexer_v3.l:
>
> :< /* $Id: lexer_v3.l,v 1.162 2005/06/27 00:40:48 relson Exp $ */
> :> /* $Id: lexer_v3.l,v 1.167 2006/07/04 03:47:37 relson Exp $ */
>
> So this is 1.0.3 vs 1.1.1
>
> :< ID <?[[:alnum:]-]*>?
> :> ID <?[[:alnum:]\-\.]*>?
>
> What is the new dot good for? CVS has "Cleanup queue-id
> processing." as a comment. I am not sure what it relates to,
> but the long comment in the beginning of lexer_v3.1 says
> something about avoiding dots.
It allows dots within IDs.
> :> SHORT_TOKEN {TOKENFRONT}{TOKENBACK}?
> :> T1 [[:alpha:]]
> :< TOKEN_12 ({TOKEN}|{T12})
> :> TOKEN_12 ({TOKEN}|{T12}|{T1})
>
> We now have:
> T1 [[:alpha:]]
> T12 [[:alpha:]][[:alnum:]]?
> TOKEN_12 ({TOKEN}|{T12}|{T1})
>
> If I am not totally wrong, a string matching T1 will also
> match T12, so we could simply drop the new addition.
You are correct. Removing T1 does not affect "make check", so it'll
be removed from CVS shortly.
> BTW, what was the reason, that TOKEN is not allowed to start
> with one digit, but may contain digits inside?
This makes "A123" a valid token while "1234" is not a valid
token. Allowing tokens that are totally numeric would be a
bad thing, no?
> :< old: ENCODED_WORD =\?{CHARSET}\?(b\?{BASE64}|q\?{QP})\?=
> :> old: ENCODED_WORD =\?{CHARSET}\?(b\?{BASE64}\|q\?{QP})\?=
> :< HTML_WO_COMMENTS "<"[^!][^>]*">"|"<>"
> :> HTML_WO_COMMENTS "<"[^!][^>]*">"\|"<>"
>
> Pure make-up.
>
> :< <HTOKEN>{TOKEN} { return
> TOKEN; } :> <HTOKEN>({TOKEN}|{SHORT_TOKEN})
> { return TOKEN; } :< {TOKEN}
> { return TOKEN;} :>
> ({TOKEN}|{SHORT_TOKEN}) { return TOKEN;}
>
> Why not define TOKEN in the first place like this:
> {TOKENFRONT}({TOKENMID}{TOKENBACK})? and TOKENMID with a *
> instead of a + in the end?
As best I can tell, your suggestions add up to "SHORT_TOKEN isn't
necessary. A few changes to TOKEN can eliminate it." Even if that's
not exactly what you're thinking, I've eliminated SHORT_TOKEN without
breaking "make check".
With the suggested changes to TOKEN and TOKENMID, it seems that TOKEN
works fine wherever TOKEN_12 is used, i.e. that T12 and TOKEN_12 can
be eliminated. Right?
> :< \${NUM}(\.{NUM})? { return
> TOKEN;} /* Dollars and cents */ :>
> \${NUM}(\.{NUM})? { return
> MONEY;} /* Dollars and cents */
>
> What is the new return code good for? But anyhow, for me
> those would be normal tokens;-)
File token.c had some special processing to allow 2 character money
tokens, i.e. "$1", "$2", etc. The MONEY code allows a cleaner
implementation of this special case.
I've attached a patch file with the changes from 1.1.1 to current cvs
for lexer_v3.l and token.c. If you have further improvements (that
don't break "make check"), I'm all ears.
Enjoy!
David
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: patch.1126.parsing.txt
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20061126/1744904c/attachment.txt>
More information about the Bogofilter
mailing list