lexer changes

Wed Nov 12 13:06:24 CET 2003

On Wed, 12 Nov 2003 10:53:51 +0100
Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> wrote:

> David Relson wrote:
> 
> >> >> I am not sure about  MSG_COUNT	^\".MSG_COUNT\" -- are those
> >> >> \ needed?
> >> > 
> >> > Does it matter?  It works.  As they say, "if it ain't broke,
> >don't> > fix it."
> >> 
> >> It is confusing. And as you said yesterday, there is "code
> >> that _looks_ ok (on casual inspection) but is actually
> >> incorrect", so I try to understand if it is correct, not if
> >> it seems to work.
> > 
> > Try it.  Take a pristine lexer_v3.l, make sure it passes "make
> > check", then remove the quotes and see what happens.
> 
> Interesting that even you don't know if they are required;-)
> According to make check they are required. So I now
> understand some expressions more which do or don't have \.
> This does make a difference.
> 
> pi

pi,

It shouldn't be news to you that I'm not a flex expert.  I've used the
list to request help on several equations.  My knowledge of flex is
based on use, experimentation, and reading the documentation.  I don't
know it all and I'm still learning.

"make check" is (often) the quickest way to see if a lexer modification
causes a parsing difference.  When there is a difference I can analyse
what's changed and decide if it should be different.  If so, I change
the reference results (the stored output files).

AFAIK the goal of the parser was to break a message into words, with a
word being a letter followed by letters or digits and with certain
special characters allowed (like periods for abbreviations, apostrophes
for contractions, etc).  Unfortunately with flex it's sometimes
necessary to list what's _not_ allowed which leads to constructs like
TOKENMID's "[^[:blank:]<>;=():&%$#@+|/\\{}^\"?*,[:cntrl:]\[\]]".  I
don't like listing all those special characters but that's what flex
needs to do its job.  So, I live with it.

As bogofilter has evolved, we've learned it's useful to include money
values, recognize IP addresses, ignore message IDs, remove html
comments, include tokes within a|img|font tags, etc.  The lexer has
grown and become more complicated.  It would be nice if it could be
really, really simple, but it's not :-(

David