wordhashes [was: time test]

Mon Nov 25 19:48:54 CET 2002

At 12:32 AM 11/25/02, Gyepi SAM wrote:

>wordhash_next returns the next node of a linked list. When a word is added 
>to the hash,
>it is also added to the end of a linked list of all words seen. 
>wordhash_first() and wordhash_next()
>are then used to walk down the list. This is an O(n) operation.
>
>Keep in mind that if you call bogofilter on an mbox file with multiple emails,
>the
>         for(n = wordhash_first(h); n != NULL; n = wordhash_next(h)){...}
>
>loop will be called at the end of each message since we need to cap the 
>per message frequencies of each word.
>Upon each start of the loop, the number of words on the list increases by 
>the number of words in the last message.
>For a mailbox with m messages, each containing w words, there will be n * 
>w words on the list on the nth start of the loop.
>So the words from the first message are 'seen' [1] n times, the second 
>(n-1) times ... in effect, we look at each 'message' [1] (n^2 + n)/2 times.

What isn't said explicitly, and I infer to be true, is that wordhash is 
keeping track of all words to minimize usage of the BerkeleyDB.  Is this 
correct?  Has this caching/optimization been measure and shown to be a 
speed win?

>The only alternative to walking the list at the end of every message is to 
>associate, with each word, a list of counts
>for each message it has appeared in. Then at the end of the whole message 
>processing loop, we walk the word list,
>and for each word, walk the message count list,  cap each count, add it to 
>that word's sum, then at the end,
>update the word's total count. This would be considerably faster since we 
>are guaranteed to only look at a word however
>many times it appears, plus one more time when we walk the entire list.

Let me suggest an alternative.  For each word keep track of two 
counts.  The first one is for the current message.  The
second is the cumulative count for all prior messages.  The processing then 
becomes -

1)collect_words and determine their counts (current).
2)at the end of the message, make a pass over the list adding current count 
to cumulative count and clearing current count.
... repeat 1 & 2 for each message
3) at end, use cumulative counts to update database.

This would give a known entry size for each word in the hash at the cost of 
an extra (linear) pass over the word list for each message.

A minor optimization would be to do step 2 at the _beginning_ of the 2nd, 
3rd, etc messages.  Step 3 would then use "current plus cumulative count" 
when updating the database.

>Given, however, that most invocations of bogofilter will be to process a 
>single email, it may not be worth the effort
>to implement this optimization.
>
>[1] We 'look' at each message n times in the sense that we see the words 
>in the message n times, but not necessarily in the same order that they 
>occur in the message itself.
>
>-Gyepi