' or ` at TOKENBACK (was: lexer change)

Mon Nov 17 13:13:34 CET 2003

On Mon, 17 Nov 2003 12:09:16 +0100
Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> wrote:

> Boris 'pi' Piwinger wrote:
> 
> > Here is once more pretty much what a token is:
> >> TOKENFRONT	[^[:blank:][:cntrl:][:digit:][:punct:]]
> >> TOKENMID	[^[:blank:]<>;=():&%$#@+|/\\{}^\"?*,[:cntrl:][\]]*
> >> TOKENBACK	[^[:blank:]<>;=():&%$#@+|/\\{}^\"?*,[:cntrl:][\]._+-]
> > 
> > I'll even have to look up which of theese characters listed
> > is not in [:punct:] which is AFAICS this list: ! " # $ % & '
> > ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~
> > (http://www.gnu.org/software/grep/doc/grep_8.html)
> 
> > BTW: There is a + doubled in TOKENBACK.
> 
> AFAICS it still is.

Flex processes the character list so changing the order and adding or
removing duplicates doesn't matter (ultimately).

> > At the end of a word we only allow !'` in addition to those
> > allowed at the front. I cannot say why ' or ` should be
> > there.
> 
> Nobody took that up. With the latest CVS lexer I get this:
> $ echo test\' |bogolexer
> normal mode.
> get_token: 1 "head:test'"
> 1 tokens read.
> $ echo test\` |bogolexer
> normal mode.
> get_token: 1 "head:test`"
> 1 tokens read.

As you say, they were probably overlooked.  

BTW, the output can be made cleaner.  Try "echo test\' test\` |
bogolexer -p -PH"

> This is what you would expect from the definition of
> TOKENBACK. I believe this has just been overlooked. The
> allowed ! is in a comment explaining we want it, those are
> not mentioned, so probably they just got lost from expanding
> [:punct:] in TOKENBACK.
> 
> I cannot see any reason why we should accept those two
> characters at the end of a token. So I'd remove them.
> 
> For better readability, I'd also move the [:cntrl:] to the
> beginning and do some reordering, also expand [:punct:] for
> comparability, like this:
> > TOKENFRONT	[^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]?=():#+$._'`~![:digit:]-]
> > TOKENBACK	[^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]?=():#+$._'`~-]
> > TOKENMID	[^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]?=():#+]+
> > BOGOLEX_TOKEN	[^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]]+
> 
> Please double-check since I backported this from my new
> version of the lexer.
> 
> This breaks some tests t.lexer.mbx and t.maint, t.systest
> was broken for me already before.

I've made the suggested changes to my copy of lexer_v3.l.  This evening
I'll have time to run "make check" and look at what's changed.  If the
results seem reasonable, I'll commit the lexer change and update the
test results.

David