lexer investigations

Tue Feb 25 01:51:29 CET 2003

Nick,

As you know I've been experimenting with combining the current trio of 
lexers into a single, new lexer that uses lexer states, a.k.a. start 
conditions.  It's been quite interesting.  So far I've found 2 errors in 
the existing lexer.

First, lexer_head.l has a boundary pattern that allows '#' and sets min and 
max lengths of 1 and 70.  lexer_text_plain.l and lexer_text_html.l don't 
allow the '#' and don't have the length limitations.  I've updated 
lexer_text_*.l, as well as the reference outputs for the regression tests.

Second, in comparing outputs of the fixed lexer (above) and my experimental 
lexer, I've found anomalies in the processing of "^From " message 
headers.  Attached file bradyn.mbx.gz has an excerpt from 
tests/bogofilter/inputs/good.mbx.  Unpack it and run the following commands:

	rm -f *.db
	bogofilter -C -d . -n -I bradyn.mbx
	bogoutil -w goodlist.db bradyn

bogoutil will print out a count of 2 when the correct count is 1.  If you 
run bogoutil with "-x l -vv" flags it appears that flex is "saving" that 
token and returning it in a later message.

I'm changing lexer_*.l so that tokens from message separators are _not_ 
returned to bogofilter.  Using postfix (as I do), the message separator is 
created by postfix and contains info that is in the "Return-Path:" (as well 
as a timestamp).  Discarding this info from the message separator simply 
means that bogofilter will see a few fewer duplicate tokens.  Stated 
differently, this change is harmless and doesn't affect results.

It seems that I owe you an apology.  I pooh-poohed your statements that 
"^From " processing is buggy.  I now know you were correct, though I don't 
know if the defect described above is what you've found.

I haven't yet made this latest change to cvs.  That's the next task on my 
agenda.

After doing that, I'll probably be sending you my latest (unified) lexer 
and bogofilter patches to use it.  My reason for sending it is that as a 
single lexer, it avoids all the buffer swapping problems you've discovered 
and may be usable for testing batch processing.  Unfortunately, I have not 
yet integrated html_tokenize.l with my single lexer.  Perhaps tomorrow ...

Regards,

David
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bradyn.mbx.gz
Type: application/octet-stream
Size: 782 bytes
Desc: not available
URL: <https://www.bogofilter.org/pipermail/bogofilter-dev/attachments/20030224/2e23b784/attachment.obj>
-------------- next part --------------
--------------------------------------------------------
David Relson                   Osage Software Systems, Inc.
relson at osagesoftware.com       Ann Arbor, MI 48103
www.osagesoftware.com          tel:  734.821.8800