Understanding lexer_v3.l changes

David Relson relson at osagesoftware.com
Sun Nov 26 17:49:05 CET 2006


On Sun, 26 Nov 2006 16:47:35 +0100 Boris 'pi' Piwinger wrote:

> Hi!
> 
> I just try to understand the recent changes in lexer_v3.l:
> 
> :< /* $Id: lexer_v3.l,v 1.162 2005/06/27 00:40:48 relson Exp $ */
> :> /* $Id: lexer_v3.l,v 1.167 2006/07/04 03:47:37 relson Exp $ */
> 
> So this is 1.0.3 vs 1.1.1
> 
> :< ID       <?[[:alnum:]-]*>?
> :> ID       <?[[:alnum:]\-\.]*>?
> 
> What is the new dot good for? CVS has "Cleanup queue-id
> processing." as a comment. I am not sure what it relates to,
> but the long comment in the beginning of lexer_v3.1 says
> something about avoiding dots.

It allows dots within IDs.

> :> SHORT_TOKEN   {TOKENFRONT}{TOKENBACK}?
> :> T1       [[:alpha:]]
> :< TOKEN_12      ({TOKEN}|{T12})
> :> TOKEN_12      ({TOKEN}|{T12}|{T1})
> 
> We now have: 
> T1              [[:alpha:]]
> T12             [[:alpha:]][[:alnum:]]?
> TOKEN_12        ({TOKEN}|{T12}|{T1})
> 
> If I am not totally wrong, a string matching T1 will also
> match T12, so we could simply drop the new addition.

You are correct.  Removing T1 does not affect "make check", so it'll
be removed from CVS shortly.

> BTW, what was the reason, that TOKEN is not allowed to start
> with one digit, but may contain digits inside?

This makes "A123" a valid token while "1234" is not a valid
token.  Allowing tokens that are totally numeric would be a
bad thing, no?

> :<   old: ENCODED_WORD =\?{CHARSET}\?(b\?{BASE64}|q\?{QP})\?=
> :>   old: ENCODED_WORD =\?{CHARSET}\?(b\?{BASE64}\|q\?{QP})\?=
> :< HTML_WO_COMMENTS      "<"[^!][^>]*">"|"<>"
> :> HTML_WO_COMMENTS      "<"[^!][^>]*">"\|"<>"
> 
> Pure make-up.
> 
> :< <HTOKEN>{TOKEN}                                       { return
> TOKEN; } :> <HTOKEN>({TOKEN}|{SHORT_TOKEN})
> { return TOKEN; } :< {TOKEN}
> { return TOKEN;} :>
> ({TOKEN}|{SHORT_TOKEN})                               { return TOKEN;}
> 
> Why not define TOKEN in the first place like this:
> {TOKENFRONT}({TOKENMID}{TOKENBACK})? and TOKENMID with a *
> instead of a + in the end?

As best I can tell, your suggestions add up to "SHORT_TOKEN isn't
necessary. A few changes to TOKEN can eliminate it."  Even if that's
not exactly what you're thinking, I've eliminated SHORT_TOKEN without
breaking "make check".

With the suggested changes to TOKEN and TOKENMID, it seems that TOKEN
works fine wherever TOKEN_12 is used, i.e. that T12 and TOKEN_12 can
be eliminated.  Right?

> :< \${NUM}(\.{NUM})?                             { return
> TOKEN;}        /* Dollars and cents */ :>
> \${NUM}(\.{NUM})?                             { return
> MONEY;}        /* Dollars and cents */
> 
> What is the new return code good for? But anyhow, for me
> those would be normal tokens;-)

File token.c had some special processing to allow 2 character money
tokens, i.e. "$1", "$2", etc.  The MONEY code allows a cleaner
implementation of this special case.

I've attached a patch file with the changes from 1.1.1 to current cvs
for lexer_v3.l and token.c.  If you have further improvements (that
don't break "make check"), I'm all ears.

Enjoy!

David
-------------- next part --------------
diff -u -r --exclude-from=diff.excl 111/src/lexer_v3.l cvs/src/lexer_v3.l
--- 111/src/lexer_v3.l	2006-07-03 23:47:37.000000000 -0400
+++ cvs/src/lexer_v3.l	2006-11-26 11:40:21.000000000 -0500
@@ -1,4 +1,4 @@
-/* $Id: lexer_v3.l,v 1.167 2006/07/04 03:47:37 relson Exp $ */
+/* $Id: lexer_v3.l,v 1.170 2006/11/26 16:38:07 relson Exp $ */
 
 %{
 /*
@@ -15,7 +15,6 @@
  * We throw away headers that are readily identifiable as dates.
  * We throw away all digit strings that don't look like IP address parts.
  * We thow away lines beginning with <tab>id<space> -- mailer UDs.
- * We throw away *all* tokens of length 1 or 2.
  *
  * These are optimizations to keep the token lists from bloating.
  * The big win is recognizing machine-generated unique IDs that
@@ -137,7 +136,7 @@
 BCHARS		[[:alnum:]()+_,-./:=?#\' ]
 MIME_BOUNDARY	{BCHARS}*{BCHARSNOSPC}
 
-ID		<?[[:alnum:]\-\.]*>?
+ID		<?[[:alnum:]\-\.]+>?
 CHARSET		[[:alnum:]-]+
 VERPID		[[:alnum:]#-]+[[:digit:]]+[[:alnum:]#-]+
 MTYPE		[[:blank:]]*[[:alnum:]/-]*
@@ -147,16 +146,11 @@
 MSG_COUNT	^\".MSG_COUNT\"
 
 TOKENFRONT	[^[:blank:][:cntrl:][:digit:][:punct:]]
-TOKENMID	[^[:blank:][:cntrl:]<>;=():&%$#@+|/\\{}^\"?*,\[\]]+
+TOKENMID	[^[:blank:][:cntrl:]<>;=():&%$#@+|/\\{}^\"?*,\[\]]*
 BOGOLEX_TOKEN	[^[:blank:][:cntrl:]<>;    &%  @ |/\\{}^\" *,\[\]]+
 TOKENBACK	[^[:blank:][:cntrl:]<>;=():&%$#@+|/\\{}^\"?*,\[\]._~\'\`\-]
 
-TOKEN		{TOKENFRONT}{TOKENMID}{TOKENBACK}
-SHORT_TOKEN	{TOKENFRONT}{TOKENBACK}?
-
-T1		[[:alpha:]]
-T12		[[:alpha:]][[:alnum:]]?
-TOKEN_12 	({TOKEN}|{T12}|{T1})
+TOKEN		{TOKENFRONT}({TOKENMID}{TOKENBACK})?
 
 /*  RFC2047.2
     encoded-word = "=?" charset "?" encoding "?" encoded-text "?="
@@ -252,7 +246,7 @@
 <INITIAL>charset=\"?{CHARSET}\"?		{ got_charset(yytext); skip_to('='); return TOKEN; }
 
 <INITIAL>(file)?name=\"?			/* ignore */
-<INITIAL>\n?[[:blank:]]id\ {ID}			{ return QUEUE_ID; }
+<INITIAL>\n?[[:blank:]]id{WHITESPACE}+{ID}	{ return QUEUE_ID; }
 
 <INITIAL>\n[[:blank:]]				{ lineno += 1; }
 <INITIAL>\n\n					{ enum mimetype type = get_content_type();
@@ -295,14 +289,14 @@
 			return TOKEN;
 			}
 
-<HTML>{TOKEN_12}({HTMLTOKEN})+/{NOTWHITESPACE}	{ html_reorder(); }
+<HTML>{TOKEN}({HTMLTOKEN})+/{NOTWHITESPACE}	{ html_reorder(); }
 
 <HTML>"<!--"					{ BEGIN SCOMMENT; }
 <HTML>"<!"					{ BEGIN LCOMMENT; }
 <HTML>"<"(a|img|font){WHITESPACE}		{ BEGIN HTOKEN; }
 <HTML>"<"					{ BEGIN HDISCARD; }	/* unknown tag */
 
-<HTOKEN>({TOKEN}|{SHORT_TOKEN})			{ return TOKEN; }
+<HTOKEN>({TOKEN})				{ return TOKEN; }
 <HDISCARD,LCOMMENT,SCOMMENT>{TOKEN}		{ /* discard innards of html tokens and comments */ }
 
 <HTOKEN,HDISCARD,LCOMMENT>">"			{ BEGIN HTML; }	/* end of tag, loose comment; return to normal html processing */
@@ -312,9 +306,9 @@
 {IPADDR}					{ return IPADDR;}
 "\["({IPADDR})"\]"				{ return MESSAGE_ADDR;}
 
-({TOKEN}|{SHORT_TOKEN})				{ return TOKEN;}
+({TOKEN})					{ return TOKEN;}
 
-<HTML>{TOKEN_12}?{HTML_ENCODING}		{ html_char(); }	/* process escaped chars, eg 'e' is 'a' */
+<HTML>{TOKEN}?{HTML_ENCODING}			{ html_char(); }	/* process escaped chars, eg 'e' is 'a' */
 <HTOKEN>"/"[^/[:blank:]\n%]*{URL_ENCODING}+	{ url_char(); }		/* process escaped chars, eg '%61'    is 'a' */
 
 \${NUM}(\.{NUM})?				{ return MONEY;}	/* Dollars and cents */
diff -u -r --exclude-from=diff.excl 111/src/token.c cvs/src/token.c
--- 111/src/token.c	2006-08-10 21:43:59.000000000 -0400
+++ cvs/src/token.c	2006-11-26 11:34:30.000000000 -0500
@@ -1,4 +1,4 @@
-/* $Id: token.c,v 1.151 2006/08/11 01:43:59 relson Exp $ */
+/* $Id: token.c,v 1.153 2006/11/26 16:34:30 relson Exp $ */
 
 /*****************************************************************************
 
@@ -591,8 +591,8 @@
 	w_recv = word_news("rcvd:");	/* Received:    */
 	w_head = word_news("head:");	/* Header:      */
 	w_mime = word_news("mime:");	/* Mime:        */
-	w_ip   = word_news("ip");	/* ip:          */
-	w_url  = word_news("url");	/* url:         */
+	w_ip   = word_news("ip:");	/* ip:          */
+	w_url  = word_news("url:");	/* url:         */
 	nonblank_line = word_news(NONBLANK);
 
 	/* do multi-word token initializations */


More information about the bogofilter mailing list