ideas for greater bdb efficiency
Gyepi SAM
gyepi at praxis-sw.com
Fri Jan 24 13:12:01 CET 2003
On Thu, Jan 23, 2003 at 06:21:12PM -0500, David Relson wrote:
> At 12:32 PM 1/23/03, Matt Armstrong wrote:
> >Also, I haven't verified the code doesn't do this, but greater
> >efficiency could possibly be achieved by batching database accesses
> >into chunks (say, 1000 or 5000 lookups each) and sorting the words
> >before looking them up. This way, entries are looked up in order and
> >the BTree nature of the database is taken care of (locality of
> >reference and all that). Also, it is then trivial to eliminate
> >duplicate lookups since they will be adjacent to each other.
> Bogofilter uses a routine called collect_words() to save the tokens from
> the message. Part of its operation is to detect duplications. Thus
> bogofilter only has to hit each database once for each token. I don't know
> whether the words are in order or not.
The words are stored in a wordhash which stores the words in a hash, but also
uses a linked list for the hash traversal, so iterating over a wordhash returns
the words in insertion order. It would certainly be possible to sort the words if
desired.
-Gyepi
More information about the bogofilter-dev
mailing list