Test with different lexers

Boris 'pi' Piwinger 3.14 at logic.univie.ac.at
Tue Dec 2 17:53:12 CET 2003


Tom Anderson wrote:

> To repeat to the list what I previously said in a private email: The
> difference comes from what you consider a "special" character in a
> token.  To me, every non-space ascii ought to be allowed anywhere in any
> token.  Why would we give special consideration to A-Za-z?  I think you
> assume too much about what a token _ought_ to consist of, rather than
> what it _does_ consist of.  How about "100%"? Or ";-)"?  Or "[sic]"? 
> These are important tokens!  I don't think we should be assuming that
> tokens must be proper english words.

That certainly is an interesting idea, but punctuation is
different in my opinion, this is not because it works, but
because of the function in language which makes it special
which then should be reflected in a parser.

In you above paragraph we would have those tokens:
email:
"special"
token.
me,
_does_
etc.

It seems very unnatural from the definition of punctuation
in any language I know to do that as opposed to just:
email
special
token
me
does

On the other hand we do allow some punctuation in words to
cover V.I.A.G.R.A or up-to-date or MSG_COUNT. Maybe you are
right and this is wrong.

pi




More information about the Bogofilter mailing list