ideas for greater bdb efficiency
matt at lickey.com
Fri Jan 24 00:51:53 EST 2003
David Relson <relson at osagesoftware.com> writes:
> At 12:32 PM 1/23/03, Matt Armstrong wrote:
>>I notice that bogofilter uses 2 separate databases -- one for good and
>>one for bad. I bet a high percentage of common words are present in
>>both databases. Things could be more efficient (in space and time) if
>>one database were used and each record in the database held both a
>>SPAM and ham count, right? This need not limit the future potential
>>for more than just SPAM and ham classifications.
>>Also, I haven't verified the code doesn't do this, but greater
>>efficiency could possibly be achieved by batching database accesses
>>into chunks (say, 1000 or 5000 lookups each) and sorting the words
>>before looking them up. This way, entries are looked up in order and
>>the BTree nature of the database is taken care of (locality of
>>reference and all that). Also, it is then trivial to eliminate
>>duplicate lookups since they will be adjacent to each other.
> We always appreciate new ideas. I'd love to know the results of
> running with a combined wordlist. Are you up for modifying the code
> and running the test?
> Bogofilter uses a routine called collect_words() to save the tokens
> from the message. Part of its operation is to detect duplications.
> Thus bogofilter only has to hit each database once for each token. I
> don't know whether the words are in order or not. Another experiment
> would be to see if they are or are not and how much of a difference it
> makes. If you can do that, it'd be another contribution.
Probably eventually. This has been in my head for a while and I
wanted to get the idea out there.
More information about the Bogofilter-dev