Understanding lexer_v3.l changes
David Relson
relson at osagesoftware.com
Sun Nov 26 17:49:05 CET 2006
On Sun, 26 Nov 2006 16:47:35 +0100 Boris 'pi' Piwinger wrote:
> Hi!
>
> I just try to understand the recent changes in lexer_v3.l:
>
> :< /* $Id: lexer_v3.l,v 1.162 2005/06/27 00:40:48 relson Exp $ */
> :> /* $Id: lexer_v3.l,v 1.167 2006/07/04 03:47:37 relson Exp $ */
>
> So this is 1.0.3 vs 1.1.1
>
> :< ID <?[[:alnum:]-]*>?
> :> ID <?[[:alnum:]\-\.]*>?
>
> What is the new dot good for? CVS has "Cleanup queue-id
> processing." as a comment. I am not sure what it relates to,
> but the long comment in the beginning of lexer_v3.1 says
> something about avoiding dots.
It allows dots within IDs.
> :> SHORT_TOKEN {TOKENFRONT}{TOKENBACK}?
> :> T1 [[:alpha:]]
> :< TOKEN_12 ({TOKEN}|{T12})
> :> TOKEN_12 ({TOKEN}|{T12}|{T1})
>
> We now have:
> T1 [[:alpha:]]
> T12 [[:alpha:]][[:alnum:]]?
> TOKEN_12 ({TOKEN}|{T12}|{T1})
>
> If I am not totally wrong, a string matching T1 will also
> match T12, so we could simply drop the new addition.
You are correct. Removing T1 does not affect "make check", so it'll
be removed from CVS shortly.
> BTW, what was the reason, that TOKEN is not allowed to start
> with one digit, but may contain digits inside?
This makes "A123" a valid token while "1234" is not a valid
token. Allowing tokens that are totally numeric would be a
bad thing, no?
> :< old: ENCODED_WORD =\?{CHARSET}\?(b\?{BASE64}|q\?{QP})\?=
> :> old: ENCODED_WORD =\?{CHARSET}\?(b\?{BASE64}\|q\?{QP})\?=
> :< HTML_WO_COMMENTS "<"[^!][^>]*">"|"<>"
> :> HTML_WO_COMMENTS "<"[^!][^>]*">"\|"<>"
>
> Pure make-up.
>
> :< <HTOKEN>{TOKEN} { return
> TOKEN; } :> <HTOKEN>({TOKEN}|{SHORT_TOKEN})
> { return TOKEN; } :< {TOKEN}
> { return TOKEN;} :>
> ({TOKEN}|{SHORT_TOKEN}) { return TOKEN;}
>
> Why not define TOKEN in the first place like this:
> {TOKENFRONT}({TOKENMID}{TOKENBACK})? and TOKENMID with a *
> instead of a + in the end?
As best I can tell, your suggestions add up to "SHORT_TOKEN isn't
necessary. A few changes to TOKEN can eliminate it." Even if that's
not exactly what you're thinking, I've eliminated SHORT_TOKEN without
breaking "make check".
With the suggested changes to TOKEN and TOKENMID, it seems that TOKEN
works fine wherever TOKEN_12 is used, i.e. that T12 and TOKEN_12 can
be eliminated. Right?
> :< \${NUM}(\.{NUM})? { return
> TOKEN;} /* Dollars and cents */ :>
> \${NUM}(\.{NUM})? { return
> MONEY;} /* Dollars and cents */
>
> What is the new return code good for? But anyhow, for me
> those would be normal tokens;-)
File token.c had some special processing to allow 2 character money
tokens, i.e. "$1", "$2", etc. The MONEY code allows a cleaner
implementation of this special case.
I've attached a patch file with the changes from 1.1.1 to current cvs
for lexer_v3.l and token.c. If you have further improvements (that
don't break "make check"), I'm all ears.
Enjoy!
David
-------------- next part --------------
diff -u -r --exclude-from=diff.excl 111/src/lexer_v3.l cvs/src/lexer_v3.l
--- 111/src/lexer_v3.l 2006-07-03 23:47:37.000000000 -0400
+++ cvs/src/lexer_v3.l 2006-11-26 11:40:21.000000000 -0500
@@ -1,4 +1,4 @@
-/* $Id: lexer_v3.l,v 1.167 2006/07/04 03:47:37 relson Exp $ */
+/* $Id: lexer_v3.l,v 1.170 2006/11/26 16:38:07 relson Exp $ */
%{
/*
@@ -15,7 +15,6 @@
* We throw away headers that are readily identifiable as dates.
* We throw away all digit strings that don't look like IP address parts.
* We thow away lines beginning with <tab>id<space> -- mailer UDs.
- * We throw away *all* tokens of length 1 or 2.
*
* These are optimizations to keep the token lists from bloating.
* The big win is recognizing machine-generated unique IDs that
@@ -137,7 +136,7 @@
BCHARS [[:alnum:]()+_,-./:=?#\' ]
MIME_BOUNDARY {BCHARS}*{BCHARSNOSPC}
-ID <?[[:alnum:]\-\.]*>?
+ID <?[[:alnum:]\-\.]+>?
CHARSET [[:alnum:]-]+
VERPID [[:alnum:]#-]+[[:digit:]]+[[:alnum:]#-]+
MTYPE [[:blank:]]*[[:alnum:]/-]*
@@ -147,16 +146,11 @@
MSG_COUNT ^\".MSG_COUNT\"
TOKENFRONT [^[:blank:][:cntrl:][:digit:][:punct:]]
-TOKENMID [^[:blank:][:cntrl:]<>;=():&%$#@+|/\\{}^\"?*,\[\]]+
+TOKENMID [^[:blank:][:cntrl:]<>;=():&%$#@+|/\\{}^\"?*,\[\]]*
BOGOLEX_TOKEN [^[:blank:][:cntrl:]<>; &% @ |/\\{}^\" *,\[\]]+
TOKENBACK [^[:blank:][:cntrl:]<>;=():&%$#@+|/\\{}^\"?*,\[\]._~\'\`\-]
-TOKEN {TOKENFRONT}{TOKENMID}{TOKENBACK}
-SHORT_TOKEN {TOKENFRONT}{TOKENBACK}?
-
-T1 [[:alpha:]]
-T12 [[:alpha:]][[:alnum:]]?
-TOKEN_12 ({TOKEN}|{T12}|{T1})
+TOKEN {TOKENFRONT}({TOKENMID}{TOKENBACK})?
/* RFC2047.2
encoded-word = "=?" charset "?" encoding "?" encoded-text "?="
@@ -252,7 +246,7 @@
<INITIAL>charset=\"?{CHARSET}\"? { got_charset(yytext); skip_to('='); return TOKEN; }
<INITIAL>(file)?name=\"? /* ignore */
-<INITIAL>\n?[[:blank:]]id\ {ID} { return QUEUE_ID; }
+<INITIAL>\n?[[:blank:]]id{WHITESPACE}+{ID} { return QUEUE_ID; }
<INITIAL>\n[[:blank:]] { lineno += 1; }
<INITIAL>\n\n { enum mimetype type = get_content_type();
@@ -295,14 +289,14 @@
return TOKEN;
}
-<HTML>{TOKEN_12}({HTMLTOKEN})+/{NOTWHITESPACE} { html_reorder(); }
+<HTML>{TOKEN}({HTMLTOKEN})+/{NOTWHITESPACE} { html_reorder(); }
<HTML>"<!--" { BEGIN SCOMMENT; }
<HTML>"<!" { BEGIN LCOMMENT; }
<HTML>"<"(a|img|font){WHITESPACE} { BEGIN HTOKEN; }
<HTML>"<" { BEGIN HDISCARD; } /* unknown tag */
-<HTOKEN>({TOKEN}|{SHORT_TOKEN}) { return TOKEN; }
+<HTOKEN>({TOKEN}) { return TOKEN; }
<HDISCARD,LCOMMENT,SCOMMENT>{TOKEN} { /* discard innards of html tokens and comments */ }
<HTOKEN,HDISCARD,LCOMMENT>">" { BEGIN HTML; } /* end of tag, loose comment; return to normal html processing */
@@ -312,9 +306,9 @@
{IPADDR} { return IPADDR;}
"\["({IPADDR})"\]" { return MESSAGE_ADDR;}
-({TOKEN}|{SHORT_TOKEN}) { return TOKEN;}
+({TOKEN}) { return TOKEN;}
-<HTML>{TOKEN_12}?{HTML_ENCODING} { html_char(); } /* process escaped chars, eg 'e' is 'a' */
+<HTML>{TOKEN}?{HTML_ENCODING} { html_char(); } /* process escaped chars, eg 'e' is 'a' */
<HTOKEN>"/"[^/[:blank:]\n%]*{URL_ENCODING}+ { url_char(); } /* process escaped chars, eg '%61' is 'a' */
\${NUM}(\.{NUM})? { return MONEY;} /* Dollars and cents */
diff -u -r --exclude-from=diff.excl 111/src/token.c cvs/src/token.c
--- 111/src/token.c 2006-08-10 21:43:59.000000000 -0400
+++ cvs/src/token.c 2006-11-26 11:34:30.000000000 -0500
@@ -1,4 +1,4 @@
-/* $Id: token.c,v 1.151 2006/08/11 01:43:59 relson Exp $ */
+/* $Id: token.c,v 1.153 2006/11/26 16:34:30 relson Exp $ */
/*****************************************************************************
@@ -591,8 +591,8 @@
w_recv = word_news("rcvd:"); /* Received: */
w_head = word_news("head:"); /* Header: */
w_mime = word_news("mime:"); /* Mime: */
- w_ip = word_news("ip"); /* ip: */
- w_url = word_news("url"); /* url: */
+ w_ip = word_news("ip:"); /* ip: */
+ w_url = word_news("url:"); /* url: */
nonblank_line = word_news(NONBLANK);
/* do multi-word token initializations */
More information about the bogofilter
mailing list