Short tokens and numbers

Tue Nov 4 12:57:17 CET 2003

Boris 'pi' Piwinger wrote:

>>> >>Maybe someone can explain why we use (my version):
>>> >>TOKEN		{TOKENFRONT}{TOKENMID}{TOKENBACK}{0,70}
>>> >>instead of
>>> >>TOKEN		{TOKENFRONT}{TOKENMID}{0,70}{TOKENBACK}
>>> >>where we had to modify TOKENMID (remove *) and TOKENBACK
>>> >>(add  ?) appropriately.
>>> 
>>> Any answer for this?
>>
>>As I interpret the TOKENMID and TOKENBACK patterns, the first limits
>>what's allowed as the first character while the second defines what's
>>permitted in the middle and end positions.  Perhaps they should be named
>>TOKEN_FIRST and TOKEN_REST (or TOKEN_HEAD and TOKEN_TAIL).
> 
> Obviously (ignoring the quantifiers) TOKENBACK is a proper
> subset of TOKENMID (namenly not allowing ._-+ (the order is
> changed, which is irritating but of course this is not
> important). I don't know which tokens need to be escaped
> with \.

Well, if I take them away it still builds and passes all
tests. I also cannot find a difference with *all* my mails.
So I assume, my changes are correct. I'm also cleaning up
the order to make it more easily readable.

> But: I just try to understand the rationale why this way.
>
> Let me try to describe what we do: We start with TOKENFRONT
> (one character). Then any number of TOKENMID followed by up
> to 70 characters of TOKENEND. If quantifiers are greedy in
> that language, then we actually never use TOKENBACK (in the
> standard version exactly one character here).
[...]
> So here is my idea what it should have been:
> 
> :TOKENFRONT	[^[:blank:][:cntrl:][:digit:][:punct:]]
> :TOKENMID	[^[:blank:]<>;=():&%$#@+|/\\{}^\"\?\*,[:cntrl:]\[\]]
> :TOKENBACK	[^[:blank:]<>;=():&%$#@+|/\\{}^\"\?\*\._\-\+,\[\][:cntrl:]]
> : 
> :TOKEN		{TOKENFRONT}{TOKENMID}{1,70}{TOKENBACK}

So I'm also testing this. It does not work. I have no clue
why. Maybe someone can explain. I cannot give {1,70} at this
place nor at the end of TOKENMID. flex just stalls.

If I remove {1,70} and leave the + at the end of TOKENMID it
works though. This is what I described above, the range
would never be used.

Still all my mails are treated exactly the same. So this
looks much more like what is intended, but cleaner and
better to read. This is my first patch attached.


Also I found that my previous patch which would allow tokens
of lenght 1 and 2 plus numbers (and tokens starting with
those) has a bug by allowing a word of the form
{TOKENFRONT}{TOKENMID} which is clearly unwanted.

Looking at my database I note that only a single token of
length 1 is used in the calculations (of course, this
depends on my personal settings); namely Q is hammish. So It
really doesn't seem to be worth adding those. But tokens of
length 2 do make a difference. So my second patch will still
allow for numbers (remember the discussion about phone
numbers, also prices might be interesting) and for tokens of
lenght 2, while dropping those of lenght 1 and fixing the
bug. This patch applies to the original lexer_v3.1 again, so
both patches are incompatible. Note that make check will
fail with the second patch.

> Further, we would allow a sequence of any length, didn't we
> want to limit that?

In the standard 0.15.8 build:

Something looks very broken. The following is not a token:
abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz
Why not? It has 52 characters?
abcdefghijklmnopqrstuvwxyzabcde is also no token while
abcdefghijklmnopqrstuvwxyzabcd is. So there is something at
another place which blocks us. Looks inconsistent.

pi
-------------- next part --------------

--- lexer_v3.l.bak	Tue Nov  4 10:38:23 2003
+++ lexer_v3.l	Tue Nov  4 11:24:34 2003
@@ -133,12 +133,12 @@
 NUM_NUM		\ [0-9]+\ [0-9]+
 MSG_COUNT	^\"\.MSG_COUNT\"
 
+BOGOLEX_TOKEN	[^[:blank:]<>;    &%  @ |/\\{}^" *,[:cntrl:][\]]+
 TOKENFRONT	[^[:blank:][:cntrl:][:digit:][:punct:]]
-TOKENMID	[^[:blank:]<>;=():&%$#@+|/\\{}^\"\?\*,[:cntrl:]\[\]]+
-BOGOLEX_TOKEN	[^[:blank:]<>;    &%  @ |/\\{}^\"  \*,[:cntrl:]\[\]]+
-TOKENBACK	[^[:blank:]<>;=():&%$#@+|/\\{}^\"\?\*\._\-\+,\[\][:cntrl:]]
+TOKENMID	[^[:blank:]<>;=():&%$#@+|/\\{}^"?*,[:cntrl:][\]]+
+TOKENBACK	[^[:blank:]<>;=():&%$#@+|/\\{}^"?*,[:cntrl:][\]._+-]
 
-TOKEN		{TOKENFRONT}{TOKENMID}{TOKENBACK}{1,70}
+TOKEN		{TOKENFRONT}{TOKENMID}{TOKENBACK}
 TOKEN_12 	({TOKEN}|{A2}|{A1})
 
 BASE64		[0-9a-zA-Z/+=]+
-------------- next part --------------
--- lexer_v3.l.bak	Tue Nov  4 10:38:23 2003
+++ lexer_v3.l	Tue Nov  4 11:39:55 2003
@@ -133,12 +133,12 @@
 NUM_NUM		\ [0-9]+\ [0-9]+
 MSG_COUNT	^\"\.MSG_COUNT\"
 
-TOKENFRONT	[^[:blank:][:cntrl:][:digit:][:punct:]]
-TOKENMID	[^[:blank:]<>;=():&%$#@+|/\\{}^\"\?\*,[:cntrl:]\[\]]+
-BOGOLEX_TOKEN	[^[:blank:]<>;    &%  @ |/\\{}^\"  \*,[:cntrl:]\[\]]+
-TOKENBACK	[^[:blank:]<>;=():&%$#@+|/\\{}^\"\?\*\._\-\+,\[\][:cntrl:]]
+BOGOLEX_TOKEN	[^[:blank:]<>;    &%  @ |/\\{}^" *,[:cntrl:][\]]+
+TOKENFRONT	[^[:blank:][:cntrl:][:punct:]]
+TOKENMID	[^[:blank:]<>;=():&%$#@+|/\\{}^"?*,[:cntrl:][\]]*
+TOKENBACK	[^[:blank:]<>;=():&%$#@+|/\\{}^"?*,[:cntrl:][\]._+-]
 
-TOKEN		{TOKENFRONT}{TOKENMID}{TOKENBACK}{1,70}
+TOKEN		{TOKENFRONT}{TOKENMID}{TOKENBACK}
 TOKEN_12 	({TOKEN}|{A2}|{A1})
 
 BASE64		[0-9a-zA-Z/+=]+