Excessive memory usage: bug?

David Relson relson at osagesoftware.com
Mon Mar 14 23:55:42 CET 2005


On Mon, 14 Mar 2005 17:05:38 GMT
JUANVAQUEROPONC wrote:

> Matthias Andree wrote:
> > It caches the whole token count to be registered in RAM so it can sort
> > the tokens to achieve acceptable performance. We tried without and it
> > was slow like a snail.
> 
> Could the code that doesn't have the tokens in memory be available for
> old machines (<=512MB of RAM) as an option.
> I'd like to test that code to see how slow it is and convince by myself
> (or just see that it works OK :-)
> Leaving all the caching to libdb (or sqlite) shouldn't change things.
> 
> Could anybody tell me how to get the old code that doesn't have the big
> token list in memory?

Juan,

Bogofilter only needs a lot of memory when you're registering a BIG
mailbox (or a directory with a LOT of messages).  My test with a 1008MB
mailbox used 325MB ram.  If you don't have such large mailboxes, you
won't need as much ram.  Other than test cases, my mailboxes are less
that 100MB, hence don't need much ram for loading into memory.

The "big" token list code (for registering mailboxes) has been in
bogofilter for approx 2 years now.  If you want to minimize ram usage,
run formail, as in:

   cat ham.mbx | formail | bogofilter -h

It'll be significantly slower because of lots of database activity.

> > Do you have lots of random garbage (rather than regular words in
> > messages) in that mailbox? Note that attachments do not count,
> > bogofilter skips them.
> 
> I have all the spam in a folder, that includes Asian and Russian
> messages.
> 
> What's the actual key and value stored in the database?
> Is it explained somewhere?
> 
> If the key is just the token and the value is just the token count, we
> may be able to try using a mysql database to manage the tokens and 
> token count using a table structure like the sqlite backend.
> I haven't been able to compile the sqlite backend yet.

The database key is the token (which is currently limited to 35
bytes).  The value is normally 3 integers (12 bytes) with 4 bytes for
the spam count, 4 bytes for the ham count, and 4 bytes for the
timestamp.

I'd expect the overhead of mysql to be higher than bogofilter's default
database, Berkeley DB.  Let us know what sort of results you achieve!

HTH,

David

_______________________________________________
Bogofilter mailing list
Bogofilter at bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter



More information about the Bogofilter mailing list