oddity.

David Relson relson at osagesoftware.com
Tue Apr 15 01:25:30 CEST 2003


At 05:59 PM 4/14/03, michael at optusnet.com.au wrote:


>I understand why the loop is there, but collect_words() ignores its
>'wordhash' input, and allocates a new hash every time it's called.
>It's returned by overwriting the 'wordhash' input.

wordhash is an output, not an input.  There is no value in wordhash before 
calling collect_words(), hence nothing to lose.

>So the loop above actually ignores everything except the
>words after the last 'From ' and leaks memory for all the
>blocks before that From.

What's there is correct, AFAIK.  When you have a counter-example to prove 
me wrong, send it.

>It sounds like you really want something like:
>
>         wordhash_t * wordhash = NULL;
>         long wordcount = 0;
>         ...
>         collect_reset();
>         do {
>                 wordhash_t * temp_hash;
>                 long temp_count;
>
>                 collect_words(&temp_hash, &temp_count, &cont);
>                 if (!wordhash) {
>                         wordhash = temp_hash;
>                         wordcount = temp_count;
>                 } else { /* Merge the new hash with the existing one */
>                         wordhash_sort(temp_hash);
>                         add_hash(wordhash, temp_hash);
>                         wordhash_free(temp_hash);
>                         wordcount += temp_count;
>                 }
>         } while(cont);
>         ...
>
>yes? (note: written by eye and not compiled; eat with care; do not machine 
>wash; if
>swallowed, seek medical advice).

Michael,

I was preparing a long reply to tell you why you were wrong and then 
decided to test the actual behavior with a specially crafted message.  To 
my chagrin, I found that you are correct.  Whatever appears before a ^From 
line is lost.  At least I never sent the message I started to write.

The proper solution may be create the wordhash before the call to 
collect_words() and let the function just add to it.  Then the higher level 
routine is responsible for allocation, management, and deallocation of 
wordhashes.  That's a better design than allocating at a lower level and 
deallocating at the higher level.  I need to do some experimentation to 
determine exactly what needs to be done and to make sure speed doesn't suffer.

David





More information about the Bogofilter mailing list