' or ` at TOKENBACK (was: lexer change)

David Relson relson at osagesoftware.com
Mon Nov 17 13:13:34 CET 2003


On Mon, 17 Nov 2003 12:09:16 +0100
Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> wrote:

> Boris 'pi' Piwinger wrote:
> 
> > Here is once more pretty much what a token is:
> >> TOKENFRONT	[^[:blank:][:cntrl:][:digit:][:punct:]]
> >> TOKENMID	[^[:blank:]<>;=():&%$#@+|/\\{}^\"?*,[:cntrl:][\]]*
> >> TOKENBACK	[^[:blank:]<>;=():&%$#@+|/\\{}^\"?*,[:cntrl:][\]._+-]
> > 
> > I'll even have to look up which of theese characters listed
> > is not in [:punct:] which is AFAICS this list: ! " # $ % & '
> > ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~
> > (http://www.gnu.org/software/grep/doc/grep_8.html)
> 
> > BTW: There is a + doubled in TOKENBACK.
> 
> AFAICS it still is.

Flex processes the character list so changing the order and adding or
removing duplicates doesn't matter (ultimately).

> > At the end of a word we only allow !'` in addition to those
> > allowed at the front. I cannot say why ' or ` should be
> > there.
> 
> Nobody took that up. With the latest CVS lexer I get this:
> $ echo test\' |bogolexer
> normal mode.
> get_token: 1 "head:test'"
> 1 tokens read.
> $ echo test\` |bogolexer
> normal mode.
> get_token: 1 "head:test`"
> 1 tokens read.

As you say, they were probably overlooked.  

BTW, the output can be made cleaner.  Try "echo test\' test\` |
bogolexer -p -PH"

> This is what you would expect from the definition of
> TOKENBACK. I believe this has just been overlooked. The
> allowed ! is in a comment explaining we want it, those are
> not mentioned, so probably they just got lost from expanding
> [:punct:] in TOKENBACK.
> 
> I cannot see any reason why we should accept those two
> characters at the end of a token. So I'd remove them.
> 
> For better readability, I'd also move the [:cntrl:] to the
> beginning and do some reordering, also expand [:punct:] for
> comparability, like this:
> > TOKENFRONT	[^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]?=():#+$._'`~![:digit:]-]
> > TOKENBACK	[^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]?=():#+$._'`~-]
> > TOKENMID	[^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]?=():#+]+
> > BOGOLEX_TOKEN	[^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]]+
> 
> Please double-check since I backported this from my new
> version of the lexer.
> 
> This breaks some tests t.lexer.mbx and t.maint, t.systest
> was broken for me already before.

I've made the suggested changes to my copy of lexer_v3.l.  This evening
I'll have time to run "make check" and look at what's changed.  If the
results seem reasonable, I'll commit the lexer change and update the
test results.

David




More information about the Bogofilter mailing list