' or ` at TOKENBACK (was: lexer change)

Mon Nov 17 12:09:16 CET 2003

Boris 'pi' Piwinger wrote:

> Here is once more pretty much what a token is:
>> TOKENFRONT	[^[:blank:][:cntrl:][:digit:][:punct:]]
>> TOKENMID	[^[:blank:]<>;=():&%$#@+|/\\{}^\"?*,[:cntrl:][\]]*
>> TOKENBACK	[^[:blank:]<>;=():&%$#@+|/\\{}^\"?*,[:cntrl:][\]._+-]
> 
> I'll even have to look up which of theese characters listed
> is not in [:punct:] which is AFAICS this list: ! " # $ % & '
> ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~
> (http://www.gnu.org/software/grep/doc/grep_8.html)

> BTW: There is a + doubled in TOKENBACK.

AFAICS it still is.

> At the end of a word we only allow !'` in addition to those
> allowed at the front. I cannot say why ' or ` should be
> there.

Nobody took that up. With the latest CVS lexer I get this:
$ echo test\' |bogolexer
normal mode.
get_token: 1 "head:test'"
1 tokens read.
$ echo test\` |bogolexer
normal mode.
get_token: 1 "head:test`"
1 tokens read.

This is what you would expect from the definition of
TOKENBACK. I believe this has just been overlooked. The
allowed ! is in a comment explaining we want it, those are
not mentioned, so probably they just got lost from expanding
[:punct:] in TOKENBACK.

I cannot see any reason why we should accept those two
characters at the end of a token. So I'd remove them.

For better readability, I'd also move the [:cntrl:] to the
beginning and do some reordering, also expand [:punct:] for
comparability, like this:
> TOKENFRONT	[^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]?=():#+$._'`~![:digit:]-]
> TOKENBACK	[^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]?=():#+$._'`~-]
> TOKENMID	[^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]?=():#+]+
> BOGOLEX_TOKEN	[^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]]+

Please double-check since I backported this from my new
version of the lexer.

This breaks some tests t.lexer.mbx and t.maint, t.systest
was broken for me already before.

pi