lexer investigations

Tue Feb 25 02:49:29 CET 2003

David Relson <relson at osagesoftware.com> writes:

> After doing that, I'll probably be sending you my latest (unified) lexer
> and bogofilter patches to use it.  My reason for sending it is that as a
> single lexer, it avoids all the buffer swapping problems you've
> discovered and may be usable for testing batch processing.
> Unfortunately, I have not yet integrated html_tokenize.l with my single
> lexer.  Perhaps tomorrow ...

I've attached ltrace -emalloc,realloc to bogofilter on a single 3 MB
token, and we know it takes a long time. I see 370 malloc(20) calls,
interspersed with some few exponentially-growing realloc(), and the time
difference between the malloc(20) is increasing as the CPU time
increases. So it seems it's not memory allocation that has the foot on
the brake. 370 malloc(20) can't be expensive, and my machine throws
around several dozen MB at a time, so even the realloc() won't harm
much. However, this buffering makes up for nasty memory waste, and the
memory use of bogofilter is of great concern for scalability issues.

What I THINK might be expensive is the distinction between "middle" and
"last" characters in a token. Would we still get reasonable results if
we omitted the trailing [^[:blank:][:punct:][:cntrl:]] part? We might
then be able to have a {TOKENFRONT}{TOKENMID}{2,29} part for a token and
a {TOKENFRONT}{TOKENMID}{30} with a custom function that skips input way
quicker than flex could do.

-- 
Matthias Andree