Excessive memory usage: bug?

David Relson relson at osagesoftware.com
Mon Mar 14 14:00:01 CET 2005


On Mon, 14 Mar 2005 12:27:11 -0000
Peter Bishop wrote:

> On 10 Mar 2005 at 18:55, David Relson wrote:
> 
> > H'lo Juan,
> > 
> > When registering a mailbox (like you're doing), bogofilter does the
> > following:
> > 
> >   1. create a master wordlist
> >   2. read one message
> >   3. convert it to a list of tokens
> >   4. merge the new tokens with the master list
> >   5. repeat steps 2-4 for all messages
> >   6. update the database with the tokens of the master wordlist
> > 
> > The above technique uses a fair amount of ram but minimizes the disk
> > access for reading and writing the database.
> 
> Is there a problem with this approach?
> If the master wordlist exceeds the available RAM, there might be an 
> "out of memory failure" once the database exceeds a certain size. 
> Even if this does not happen (e.g. because of virtual memeory) there 
> certainly could be a lot of virtual memory thrashing if the actual 
> available RAM size were exceeded.

Hi Peter,

Perhaps what is unclear is that "master wordlist" is a RAM structure
that holds the tokens read from the mailbox and the "database" is on
disk.  In the course of reading messages and merging lists, all
duplicates are removed.  Consequently there must be enough ram to hold
all the newly read tokens.  The tokens in memory are sorted so that the
database can be updated in 1 pass.  This is significantly faster than
"read/parse a message; update database; repeat."  This approach does
not require holding the full database in memory.

> An alternative method, which should work regardlless of memory size 
> is:
> 
> 1. create an empty *token* list in RAM
> 2. read one message
> 3. convert it to a list of tokens
> 4. merge the new tokens with the token list
> 5  repeat 2 till 4 until some maximum token limit is reached
> 6  read the current token counts from the database
> 8  update the database tokens with the extra token counts
> 9  write the tokens back to the database
> 10  go back to step 1 until the mailbaox is empty

This could be done, but is it necessary?  My workstation has 512MB ram
and I _have_ registered a 1008MB mailbox (using just 325MB ram).

If you're interested in such matters, there's a memdebug.c file that
tracks memory allocation and can display a summary.  Build bogofilter
with "./configure --enable-memdebug".  For 0.94.0 (or older) just run
bogofilter and it will display statistics.  For newer versions
(presently cvs, but 0.94.1 soon) use options "-x y -v" to enable the
memory display.  Be warned that memdebug requires extra memory for
bookkeepping so it can detect unreleased memory.

HTH,

David

_______________________________________________
Bogofilter mailing list
Bogofilter at bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter



More information about the Bogofilter mailing list