ideas for greater bdb efficiency

Fri Jan 24 13:12:01 CET 2003

On Thu, Jan 23, 2003 at 06:21:12PM -0500, David Relson wrote:
> At 12:32 PM 1/23/03, Matt Armstrong wrote:
> >Also, I haven't verified the code doesn't do this, but greater
> >efficiency could possibly be achieved by batching database accesses
> >into chunks (say, 1000 or 5000 lookups each) and sorting the words
> >before looking them up.  This way, entries are looked up in order and
> >the BTree nature of the database is taken care of (locality of
> >reference and all that).  Also, it is then trivial to eliminate
> >duplicate lookups since they will be adjacent to each other.
> Bogofilter uses a routine called collect_words() to save the tokens from 
> the message.  Part of its operation is to detect duplications.  Thus 
> bogofilter only has to hit each database once for each token.  I don't know 
> whether the words are in order or not.

The words are stored in a wordhash which stores the words in a hash, but also
uses a linked list for the hash traversal, so iterating over a wordhash returns
the words in insertion order.  It would certainly be possible to sort the words if
desired.

-Gyepi