ideas for greater bdb efficiency

Fri Jan 24 06:51:53 CET 2003

David Relson <relson at osagesoftware.com> writes:

> At 12:32 PM 1/23/03, Matt Armstrong wrote:
>
>>I notice that bogofilter uses 2 separate databases -- one for good and
>>one for bad.  I bet a high percentage of common words are present in
>>both databases.  Things could be more efficient (in space and time) if
>>one database were used and each record in the database held both a
>>SPAM and ham count, right?  This need not limit the future potential
>>for more than just SPAM and ham classifications.
>>
>>Also, I haven't verified the code doesn't do this, but greater
>>efficiency could possibly be achieved by batching database accesses
>>into chunks (say, 1000 or 5000 lookups each) and sorting the words
>>before looking them up.  This way, entries are looked up in order and
>>the BTree nature of the database is taken care of (locality of
>>reference and all that).  Also, it is then trivial to eliminate
>>duplicate lookups since they will be adjacent to each other.
>
> Matt,
>
> We always appreciate new ideas.  I'd love to know the results of
> running with a combined wordlist.  Are you up for modifying the code
> and running the test?
>
> Bogofilter uses a routine called collect_words() to save the tokens
> from the message.  Part of its operation is to detect duplications.
> Thus bogofilter only has to hit each database once for each token.  I
> don't know whether the words are in order or not.  Another experiment
> would be to see if they are or are not and how much of a difference it
> makes.  If you can do that, it'd be another contribution.

Probably eventually.  This has been in my head for a while and I
wanted to get the idea out there.