ideas for greater bdb efficiency

Fri Jan 24 00:21:12 CET 2003

At 12:32 PM 1/23/03, Matt Armstrong wrote:

>I notice that bogofilter uses 2 separate databases -- one for good and
>one for bad.  I bet a high percentage of common words are present in
>both databases.  Things could be more efficient (in space and time) if
>one database were used and each record in the database held both a
>SPAM and ham count, right?  This need not limit the future potential
>for more than just SPAM and ham classifications.
>
>Also, I haven't verified the code doesn't do this, but greater
>efficiency could possibly be achieved by batching database accesses
>into chunks (say, 1000 or 5000 lookups each) and sorting the words
>before looking them up.  This way, entries are looked up in order and
>the BTree nature of the database is taken care of (locality of
>reference and all that).  Also, it is then trivial to eliminate
>duplicate lookups since they will be adjacent to each other.

Matt,

We always appreciate new ideas.  I'd love to know the results of running 
with a combined wordlist.  Are you up for modifying the code and running 
the test?

Bogofilter uses a routine called collect_words() to save the tokens from 
the message.  Part of its operation is to detect duplications.  Thus 
bogofilter only has to hit each database once for each token.  I don't know 
whether the words are in order or not.  Another experiment would be to see 
if they are or are not and how much of a difference it makes.  If you can 
do that, it'd be another contribution.

Cheers!

David