Understanding lexer_v3.l changes

David Relson relson at osagesoftware.com
Sun Nov 26 17:49:05 CET 2006


On Sun, 26 Nov 2006 16:47:35 +0100 Boris 'pi' Piwinger wrote:

> Hi!
> 
> I just try to understand the recent changes in lexer_v3.l:
> 
> :< /* $Id: lexer_v3.l,v 1.162 2005/06/27 00:40:48 relson Exp $ */
> :> /* $Id: lexer_v3.l,v 1.167 2006/07/04 03:47:37 relson Exp $ */
> 
> So this is 1.0.3 vs 1.1.1
> 
> :< ID       <?[[:alnum:]-]*>?
> :> ID       <?[[:alnum:]\-\.]*>?
> 
> What is the new dot good for? CVS has "Cleanup queue-id
> processing." as a comment. I am not sure what it relates to,
> but the long comment in the beginning of lexer_v3.1 says
> something about avoiding dots.

It allows dots within IDs.

> :> SHORT_TOKEN   {TOKENFRONT}{TOKENBACK}?
> :> T1       [[:alpha:]]
> :< TOKEN_12      ({TOKEN}|{T12})
> :> TOKEN_12      ({TOKEN}|{T12}|{T1})
> 
> We now have: 
> T1              [[:alpha:]]
> T12             [[:alpha:]][[:alnum:]]?
> TOKEN_12        ({TOKEN}|{T12}|{T1})
> 
> If I am not totally wrong, a string matching T1 will also
> match T12, so we could simply drop the new addition.

You are correct.  Removing T1 does not affect "make check", so it'll
be removed from CVS shortly.

> BTW, what was the reason, that TOKEN is not allowed to start
> with one digit, but may contain digits inside?

This makes "A123" a valid token while "1234" is not a valid
token.  Allowing tokens that are totally numeric would be a
bad thing, no?

> :<   old: ENCODED_WORD =\?{CHARSET}\?(b\?{BASE64}|q\?{QP})\?=
> :>   old: ENCODED_WORD =\?{CHARSET}\?(b\?{BASE64}\|q\?{QP})\?=
> :< HTML_WO_COMMENTS      "<"[^!][^>]*">"|"<>"
> :> HTML_WO_COMMENTS      "<"[^!][^>]*">"\|"<>"
> 
> Pure make-up.
> 
> :< <HTOKEN>{TOKEN}                                       { return
> TOKEN; } :> <HTOKEN>({TOKEN}|{SHORT_TOKEN})
> { return TOKEN; } :< {TOKEN}
> { return TOKEN;} :>
> ({TOKEN}|{SHORT_TOKEN})                               { return TOKEN;}
> 
> Why not define TOKEN in the first place like this:
> {TOKENFRONT}({TOKENMID}{TOKENBACK})? and TOKENMID with a *
> instead of a + in the end?

As best I can tell, your suggestions add up to "SHORT_TOKEN isn't
necessary. A few changes to TOKEN can eliminate it."  Even if that's
not exactly what you're thinking, I've eliminated SHORT_TOKEN without
breaking "make check".

With the suggested changes to TOKEN and TOKENMID, it seems that TOKEN
works fine wherever TOKEN_12 is used, i.e. that T12 and TOKEN_12 can
be eliminated.  Right?

> :< \${NUM}(\.{NUM})?                             { return
> TOKEN;}        /* Dollars and cents */ :>
> \${NUM}(\.{NUM})?                             { return
> MONEY;}        /* Dollars and cents */
> 
> What is the new return code good for? But anyhow, for me
> those would be normal tokens;-)

File token.c had some special processing to allow 2 character money
tokens, i.e. "$1", "$2", etc.  The MONEY code allows a cleaner
implementation of this special case.

I've attached a patch file with the changes from 1.1.1 to current cvs
for lexer_v3.l and token.c.  If you have further improvements (that
don't break "make check"), I'm all ears.

Enjoy!

David
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: patch.1126.parsing.txt
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20061126/1744904c/attachment.txt>


More information about the Bogofilter mailing list