ALPHA [was: lexer change]

David Relson relson at osagesoftware.com
Tue Nov 11 00:30:18 CET 2003


On Tue, 11 Nov 2003 00:03:55 +0100
Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> wrote:

> David Relson <relson at osagesoftware.com> wrote:
...[snip]...
> >They are different.  Read flex documentation or create a test lexter
> >and test it.
> 
> I tried to read but failed to understand.

After writing the message I remembered that the flex documentation is
not "crystal clear".  Reading and understanding it is not simple, even
for a native english speaker.  Reading it in a foreign language is an
accomplishment, and my hat is off to you for making the effort (or
understandig it).


> OK, that makes a difference.
> 
> >A1 is needed for the places where a single letter needs to be
> >identified for use in a token and a2 is needed for a single letter
> >followed by a letter a digit.  An example is a token split by an html
> >comment, i.e."T<!xxx>ha<!xx>t".
> 
> I don't understand why just single letters or letters
> followed by a letter or digit and not any sequence.

Because that's how the lexer works....

In order to process "T<!xxx>ha<!xx>t" and get "That", the code that
removes the comment parts needs to have the stuff before the comment and
the stuff after it.  Without the A1 and A2 patterns, too little text
would be passed in and the result would be wrong.

The best way to answer questions like this is to create a version of
bogolexer with a modified parser and then compare its output to the
output of the official lexer.





More information about the bogofilter-dev mailing list