' or ` at TOKENBACK (was: lexer change)

Boris 'pi' Piwinger 3.14 at logic.univie.ac.at
Mon Nov 17 12:09:16 CET 2003


Boris 'pi' Piwinger wrote:

> Here is once more pretty much what a token is:
>> TOKENFRONT	[^[:blank:][:cntrl:][:digit:][:punct:]]
>> TOKENMID	[^[:blank:]<>;=():&%$#@+|/\\{}^\"?*,[:cntrl:][\]]*
>> TOKENBACK	[^[:blank:]<>;=():&%$#@+|/\\{}^\"?*,[:cntrl:][\]._+-]
> 
> I'll even have to look up which of theese characters listed
> is not in [:punct:] which is AFAICS this list: ! " # $ % & '
> ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~
> (http://www.gnu.org/software/grep/doc/grep_8.html)

> BTW: There is a + doubled in TOKENBACK.

AFAICS it still is.

> At the end of a word we only allow !'` in addition to those
> allowed at the front. I cannot say why ' or ` should be
> there.

Nobody took that up. With the latest CVS lexer I get this:
$ echo test\' |bogolexer
normal mode.
get_token: 1 "head:test'"
1 tokens read.
$ echo test\` |bogolexer
normal mode.
get_token: 1 "head:test`"
1 tokens read.

This is what you would expect from the definition of
TOKENBACK. I believe this has just been overlooked. The
allowed ! is in a comment explaining we want it, those are
not mentioned, so probably they just got lost from expanding
[:punct:] in TOKENBACK.

I cannot see any reason why we should accept those two
characters at the end of a token. So I'd remove them.

For better readability, I'd also move the [:cntrl:] to the
beginning and do some reordering, also expand [:punct:] for
comparability, like this:
> TOKENFRONT	[^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]?=():#+$._'`~![:digit:]-]
> TOKENBACK	[^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]?=():#+$._'`~-]
> TOKENMID	[^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]?=():#+]+
> BOGOLEX_TOKEN	[^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]]+

Please double-check since I backported this from my new
version of the lexer.

This breaks some tests t.lexer.mbx and t.maint, t.systest
was broken for me already before.

pi





More information about the Bogofilter mailing list