time test

Mon Nov 25 06:32:06 CET 2002

On Mon, Nov 25, 2002 at 04:38:04AM +0100, Matthias Andree wrote:

> And now for the real good stuff, the gprof output of bogofilter. Most of
> the time is spent in collect_words and wordhash_next (as called from
> collect_words). The input file has c. 587,000 lines and 2.77 million
> words. The sheer number of wordhash_next calls scares me. we have
> approx. 70 times the number of tokens as call amount. Is the algorithm
> right? I'd thought hashes are REALLY fast. (as in O(1)). Even O(n log n)
> would require a scale of 5 for the calls; n log n is 28
> million. Collision issue? Or my ignorance of how wordhash_next works?
> -------------------------------------------------------------------------
> Flat profile:
> 
> Each sample counts as 0.01 seconds.
>   %   cumulative   self              self     total           
>  time   seconds   seconds    calls   s/call   s/call  name    
>  58.62     19.34    19.34        1    19.34    32.89  collect_words
>  27.96     28.57     9.22 138833114     0.00     0.00  wordhash_next
...
>   0.65     32.72     0.21     3241     0.00     0.00  wordhash_first

wordhash_next returns the next node of a linked list. When a word is added to the hash,
it is also added to the end of a linked list of all words seen. wordhash_first() and wordhash_next()
are then used to walk down the list. This is an O(n) operation.

Keep in mind that if you call bogofilter on an mbox file with multiple emails,
the
	for(n = wordhash_first(h); n != NULL; n = wordhash_next(h)){...}

loop will be called at the end of each message since we need to cap the per message frequencies of each word.
Upon each start of the loop, the number of words on the list increases by the number of words in the last message.
For a mailbox with m messages, each containing w words, there will be n * w words on the list on the nth start of the loop.
So the words from the first message are 'seen' [1] n times, the second (n-1) times ... in effect, we look at each 'message' [1] (n^2 + n)/2 times.

The only alternative to walking the list at the end of every message is to associate, with each word, a list of counts
for each message it has appeared in. Then at the end of the whole message processing loop, we walk the word list,
and for each word, walk the message count list,  cap each count, add it to that word's sum, then at the end,
update the word's total count. This would be considerably faster since we are guaranteed to only look at a word however
many times it appears, plus one more time when we walk the entire list. 

Given, however, that most invocations of bogofilter will be to process a single email, it may not be worth the effort
to implement this optimization.

[1] We 'look' at each message n times in the sense that we see the words in the message n times, but not necessarily in the same order that they occur in the message itself.

-Gyepi