lexer changes
David Relson
relson at osagesoftware.com
Wed Nov 12 13:06:24 CET 2003
On Wed, 12 Nov 2003 10:53:51 +0100
Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> wrote:
> David Relson wrote:
>
> >> >> I am not sure about MSG_COUNT ^\".MSG_COUNT\" -- are those
> >> >> \ needed?
> >> >
> >> > Does it matter? It works. As they say, "if it ain't broke,
> >don't> > fix it."
> >>
> >> It is confusing. And as you said yesterday, there is "code
> >> that _looks_ ok (on casual inspection) but is actually
> >> incorrect", so I try to understand if it is correct, not if
> >> it seems to work.
> >
> > Try it. Take a pristine lexer_v3.l, make sure it passes "make
> > check", then remove the quotes and see what happens.
>
> Interesting that even you don't know if they are required;-)
> According to make check they are required. So I now
> understand some expressions more which do or don't have \.
> This does make a difference.
>
> pi
pi,
It shouldn't be news to you that I'm not a flex expert. I've used the
list to request help on several equations. My knowledge of flex is
based on use, experimentation, and reading the documentation. I don't
know it all and I'm still learning.
"make check" is (often) the quickest way to see if a lexer modification
causes a parsing difference. When there is a difference I can analyse
what's changed and decide if it should be different. If so, I change
the reference results (the stored output files).
AFAIK the goal of the parser was to break a message into words, with a
word being a letter followed by letters or digits and with certain
special characters allowed (like periods for abbreviations, apostrophes
for contractions, etc). Unfortunately with flex it's sometimes
necessary to list what's _not_ allowed which leads to constructs like
TOKENMID's "[^[:blank:]<>;=():&%$#@+|/\\{}^\"?*,[:cntrl:]\[\]]". I
don't like listing all those special characters but that's what flex
needs to do its job. So, I live with it.
As bogofilter has evolved, we've learned it's useful to include money
values, recognize IP addresses, ignore message IDs, remove html
comments, include tokes within a|img|font tags, etc. The lexer has
grown and become more complicated. It would be nice if it could be
really, really simple, but it's not :-(
David
More information about the bogofilter-dev
mailing list