lexer investigations
David Relson
relson at osagesoftware.com
Tue Feb 25 01:51:29 CET 2003
Nick,
As you know I've been experimenting with combining the current trio of
lexers into a single, new lexer that uses lexer states, a.k.a. start
conditions. It's been quite interesting. So far I've found 2 errors in
the existing lexer.
First, lexer_head.l has a boundary pattern that allows '#' and sets min and
max lengths of 1 and 70. lexer_text_plain.l and lexer_text_html.l don't
allow the '#' and don't have the length limitations. I've updated
lexer_text_*.l, as well as the reference outputs for the regression tests.
Second, in comparing outputs of the fixed lexer (above) and my experimental
lexer, I've found anomalies in the processing of "^From " message
headers. Attached file bradyn.mbx.gz has an excerpt from
tests/bogofilter/inputs/good.mbx. Unpack it and run the following commands:
rm -f *.db
bogofilter -C -d . -n -I bradyn.mbx
bogoutil -w goodlist.db bradyn
bogoutil will print out a count of 2 when the correct count is 1. If you
run bogoutil with "-x l -vv" flags it appears that flex is "saving" that
token and returning it in a later message.
I'm changing lexer_*.l so that tokens from message separators are _not_
returned to bogofilter. Using postfix (as I do), the message separator is
created by postfix and contains info that is in the "Return-Path:" (as well
as a timestamp). Discarding this info from the message separator simply
means that bogofilter will see a few fewer duplicate tokens. Stated
differently, this change is harmless and doesn't affect results.
It seems that I owe you an apology. I pooh-poohed your statements that
"^From " processing is buggy. I now know you were correct, though I don't
know if the defect described above is what you've found.
I haven't yet made this latest change to cvs. That's the next task on my
agenda.
After doing that, I'll probably be sending you my latest (unified) lexer
and bogofilter patches to use it. My reason for sending it is that as a
single lexer, it avoids all the buffer swapping problems you've discovered
and may be usable for testing batch processing. Unfortunately, I have not
yet integrated html_tokenize.l with my single lexer. Perhaps tomorrow ...
Regards,
David
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bradyn.mbx.gz
Type: application/octet-stream
Size: 782 bytes
Desc: not available
URL: <https://www.bogofilter.org/pipermail/bogofilter-dev/attachments/20030224/2e23b784/attachment.obj>
-------------- next part --------------
--------------------------------------------------------
David Relson Osage Software Systems, Inc.
relson at osagesoftware.com Ann Arbor, MI 48103
www.osagesoftware.com tel: 734.821.8800
More information about the bogofilter-dev
mailing list