wordhashes [was: time test]
Matthias Andree
matthias.andree at gmx.de
Mon Nov 25 20:31:04 CET 2002
On Mon, 25 Nov 2002, David Relson wrote:
> What isn't said explicitly, and I infer to be true, is that wordhash is
> keeping track of all words to minimize usage of the BerkeleyDB. Is this
> correct? Has this caching/optimization been measure and shown to be a
> speed win?
Sort of. When I changed the structure to speed up the wordhash use, and
after my collect_words returned after each message, the first edition of
register looked like this:
do {
collect_words(&h ... &cont);
register_words(...h, 1...);
} while(cont);
It was awfully slow, degraded from 23 s to anything above 140 s (I
killed it after 140 s), and it have a considerable amount of system time
(probably because we sync everywhere, but nevermind).
So the wordhash is currently responsible for a big speed increase.
Now we have (register_messages, not yet in CVS, see my patch):
do {
collect_words(&h, &wordcount, &cont);
add_hash(words, h);
wordhash_free(h);
msgcount++;
} while(cont);
register_words(_run_type, words, msgcount, wordcount);
wordhash_free(words);
and we're down to 7.5 s. Saves a good two thirds of the former execution time.
> 1)collect_words and determine their counts (current).
> 2)at the end of the message, make a pass over the list adding current count
> to cumulative count and clearing current count.
> .. repeat 1 & 2 for each message
> 3) at end, use cumulative counts to update database.
The code is there, because I had exactly the same idea. See my other
recent mails.
> A minor optimization would be to do step 2 at the _beginning_ of the 2nd,
> 3rd, etc messages. Step 3 would then use "current plus cumulative count"
> when updating the database.
I don't believe it's worth it. The former suggestion is clearer to implement.
Matthias
More information about the bogofilter-dev
mailing list