wordhashes [was: time test]

Matthias Andree matthias.andree at gmx.de
Mon Nov 25 20:31:04 CET 2002


On Mon, 25 Nov 2002, David Relson wrote:

> What isn't said explicitly, and I infer to be true, is that wordhash is 
> keeping track of all words to minimize usage of the BerkeleyDB.  Is this 
> correct?  Has this caching/optimization been measure and shown to be a 
> speed win?

Sort of. When I changed the structure to speed up the wordhash use, and
after my collect_words returned after each message, the first edition of
register looked like this:

do {
  collect_words(&h ... &cont);
  register_words(...h, 1...);
} while(cont);

It was awfully slow, degraded from 23 s to anything above 140 s (I
killed it after 140 s), and it have a considerable amount of system time
(probably because we sync everywhere, but nevermind).

So the wordhash is currently responsible for a big speed increase.

Now we have (register_messages, not yet in CVS, see my patch):

  do {
      collect_words(&h, &wordcount, &cont);
      add_hash(words, h);
      wordhash_free(h);
      msgcount++;
  } while(cont);

  register_words(_run_type, words, msgcount, wordcount);
  wordhash_free(words);

and we're down to 7.5 s. Saves a good two thirds of the former execution time.

> 1)collect_words and determine their counts (current).
> 2)at the end of the message, make a pass over the list adding current count 
> to cumulative count and clearing current count.
> .. repeat 1 & 2 for each message
> 3) at end, use cumulative counts to update database.

The code is there, because I had exactly the same idea. See my other
recent mails.

> A minor optimization would be to do step 2 at the _beginning_ of the 2nd, 
> 3rd, etc messages.  Step 3 would then use "current plus cumulative count" 
> when updating the database.

I don't believe it's worth it. The former suggestion is clearer to implement.

Matthias



More information about the bogofilter-dev mailing list