ideas for greater bdb efficiency

Matt Armstrong matt at lickey.com
Thu Jan 23 18:32:14 CET 2003


I notice that bogofilter uses 2 separate databases -- one for good and
one for bad.  I bet a high percentage of common words are present in
both databases.  Things could be more efficient (in space and time) if
one database were used and each record in the database held both a
SPAM and ham count, right?  This need not limit the future potential
for more than just SPAM and ham classifications.

Also, I haven't verified the code doesn't do this, but greater
efficiency could possibly be achieved by batching database accesses
into chunks (say, 1000 or 5000 lookups each) and sorting the words
before looking them up.  This way, entries are looked up in order and
the BTree nature of the database is taken care of (locality of
reference and all that).  Also, it is then trivial to eliminate
duplicate lookups since they will be adjacent to each other.





More information about the bogofilter-dev mailing list