ideas for greater bdb efficiency
David Relson
relson at osagesoftware.com
Fri Jan 24 00:21:12 CET 2003
At 12:32 PM 1/23/03, Matt Armstrong wrote:
>I notice that bogofilter uses 2 separate databases -- one for good and
>one for bad. I bet a high percentage of common words are present in
>both databases. Things could be more efficient (in space and time) if
>one database were used and each record in the database held both a
>SPAM and ham count, right? This need not limit the future potential
>for more than just SPAM and ham classifications.
>
>Also, I haven't verified the code doesn't do this, but greater
>efficiency could possibly be achieved by batching database accesses
>into chunks (say, 1000 or 5000 lookups each) and sorting the words
>before looking them up. This way, entries are looked up in order and
>the BTree nature of the database is taken care of (locality of
>reference and all that). Also, it is then trivial to eliminate
>duplicate lookups since they will be adjacent to each other.
Matt,
We always appreciate new ideas. I'd love to know the results of running
with a combined wordlist. Are you up for modifying the code and running
the test?
Bogofilter uses a routine called collect_words() to save the tokens from
the message. Part of its operation is to detect duplications. Thus
bogofilter only has to hit each database once for each token. I don't know
whether the words are in order or not. Another experiment would be to see
if they are or are not and how much of a difference it makes. If you can
do that, it'd be another contribution.
Cheers!
David
More information about the bogofilter-dev
mailing list