What has become of buff and word and fgetsl?

David Relson relson at osagesoftware.com
Thu Feb 27 05:45:29 CET 2003


At 11:28 PM 2/26/03, Matthias Andree wrote:

>David Relson <relson at osagesoftware.com> writes:
>
> > Right.  It's especially noticeable with Greg's n*100k 'x' files, which
> > bogofilter used to only partially process.  Stepping through the code, I
> > could see flex ask for 8192 bytes, get 76, ask for 8192-76, get 76 more,
> > ask for 8192-76-76, etc.
>
>Why is that? I mean, where is flex heading to with 8k? Is that "batch
>aka. 'swallow the feast in one go'" mode at work?

Flex seems to use 8k for its basic buffer size.  When reading Greg's file, 
flex first gets a qp line (76 x's), then tries to match a rule.  The rule 
indicates that a longer token can be matched.  Flex provides a partial 
buffer (8192-76) for the second request.  Another 76 char line is 
read.  Match token.  Need more data. loop till buffer is full.

That's as far as I traced it when I encountered the fgetsl() problem a 
couple of days ago.  I expect that what flex does is expand the buffer, 
read til full, expand, ...   I presume this continues until enough of the 
file is read in to match the pattern.  For 3.txt that amount is 100K and 
for 4.txt it's 600k.

> > Changing address and length is a classic method for adding more data
> > to an existing buffer.  Bogofilter's code for killing html comments
> > uses the same technique to process multiline commends.
>
>I still wonder if these "buffer offsetting" technique is the right thing
>to do. It harms efficience by calling OS syscall overhead upon
>ourselves. I'd agree with that code in the HTML comment killer (after
>all, things are getting shorter), but the scheme for READING from a file
>is heading for the brick wall.

Yes indeed.  There's a brick wall up ahead.

>I mean, what happens when the buffer is ultimately too small and the
>read request cannot be satisfied (e. g. you have four bytes left, but
>you must fit "weather\n")? When is the buffer drained and the ->read
>pointer reset? All this is a mystery to me currently. Is the -> read
>stuff necessary?
>
>I know it's error prone though: when you merged the buff_fgetsl.c
>function, you offset the start pointer, but you didn't reduce the
>maximum buffer size. Nice heap smasher. Is this ->read REALLY needed?

Hey I'm still debugging the changes.

>We'll need two people who've never seen bogofilter to audit the code
>before we release "stable 1.0" and]

Eh??? Audit it to look for what???


>I've written a glibc-compatible getline() emulation function with less
>strict license for leafnode that has a static buffer which is extended
>when needed -- we could use that to read into buff -- however, anything
>that relies on pointers _into_ buff, IOW, anything that relies on
>buff.text being invariant, will crash.
>
>--
>Matthias Andree
>
>---------------------------------------------------------------------
>FAQ: http://bogofilter.sourceforge.net/bogofilter-faq.html
>To unsubscribe, e-mail: bogofilter-dev-unsubscribe at aotto.com
>For summary digest subscription: bogofilter-dev-digest-subscribe at aotto.com
>For more commands, e-mail: bogofilter-dev-help at aotto.com





More information about the bogofilter-dev mailing list