wordhashes [was: time test]

Mon Nov 25 20:26:10 CET 2002

On Mon, Nov 25, 2002 at 01:48:54PM -0500, David Relson wrote:
> At 12:32 AM 11/25/02, Gyepi SAM wrote:
> >the
> >        for(n = wordhash_first(h); n != NULL; n = wordhash_next(h)){...}
> >
> >loop will be called at the end of each message since we need to cap the 
> >per message frequencies of each word.
> >Upon each start of the loop, the number of words on the list increases by 
> >the number of words in the last message.
> >For a mailbox with m messages, each containing w words, there will be n * 
> >w words on the list on the nth start of the loop.
> >So the words from the first message are 'seen' [1] n times, the second 
> >(n-1) times ... in effect, we look at each 'message' [1] (n^2 + n)/2 times.
> 
> What isn't said explicitly, and I infer to be true, is that wordhash is 
> keeping track of all words to minimize usage of the BerkeleyDB.  Is this 
> correct?  Has this caching/optimization been measure and shown to be a 
> speed win?

The inference is partially true, but the decision was not made for that reason.
wordhash is keeping track of all the words becuase it has to.
collect_words() is used for every mode.
We could utilize BerkeleyDB more and, for example, store each messages words into
the database before moving on the the next message, but this only makes sense when registering
words, and even then, only for the -s or -n modes.

The -N and -S options, in addition to regular analysis mode, would complicate matters
quite a bit. I'd rather not go there.

> >The only alternative to walking the list at the end of every message is to 
> >associate, with each word, a list of counts
> >for each message it has appeared in. Then at the end of the whole message 
> >processing loop, we walk the word list,
> >and for each word, walk the message count list,  cap each count, add it to 
> >that word's sum, then at the end,
> >update the word's total count. This would be considerably faster since we 
> >are guaranteed to only look at a word however
> >many times it appears, plus one more time when we walk the entire list.
> 
> Let me suggest an alternative.  For each word keep track of two 
> counts.  The first one is for the current message.  The
> second is the cumulative count for all prior messages.  The processing then 
> becomes -
> 
> 1)collect_words and determine their counts (current).
> 2)at the end of the message, make a pass over the list adding current count 
> to cumulative count and clearing current count.
> ... repeat 1 & 2 for each message
> 3) at end, use cumulative counts to update database.

This is exactly what we do now. We need to avoid doing step 2 for every message, and instead,
do it once, at the end of the message processing.

-Gyepi