txn performance penalty

Matthias Andree matthias.andree at gmx.de
Wed Nov 17 01:59:18 CET 2004


David Relson <relson at osagesoftware.com> writes:

> I've spotted something odd in function lookup() in file score.c.  In
> outline form, the code is:
>
>     for (list=word_lists; list != NULL; list=list->next)
>     {
> 	ds_txn_begin(list->dsh)
> 	ret = ds_read(list->dsh, token, &val);
> 	ds_txn_commit(list->dsh)
>     }

Tracing this, I found a _severe_ flaw in multiple-wordlist code that
throws off our calculations when more than one wordlist is in use except
when all have the same "override" value.

We're using a priority queue and summing over all lists at the same
"override" level as the first to have the token, no?


Anyways, we're calculating (b/g = bad/good count, mb/mg = bad/good
messages total; s/x are Robinson's values) - see prob.c:

      b/mb
p :=  -----------
      b/mb + g/mg


      s*x + (b+g)*p
f :=  -------------
      s   + (b+g)

What we're currently doing is sum up the individual list .MSG_COUNT
values to get mb/mg. We are _NOT_ summing up the relevant .MSG_COUNT,
but are using the sum of _all_ lists, so the token b/g values are bogus
when we're skipping any of the lists.

We _must_ eliminate the globals msgs_good/msgs_bad if we want
multiple_wordlist code to Do The Right Thing[tm], and pass individual
message good/bad counts depending on how many and which of the lists
we've looked at.

The integral step to

a- make sure that reading the tokens and the .MSG_COUNT happens in the
   same transaction so we aren't looking at bogus data when a large
   registration hits us in the middle of scoring - and checking where to
   move the ds_txn_() "brackets" I've seen that bogofilter goofs up the
   calculations.

b- sum up "local" msgs_good and msgs_bad in score.c that contain only
   the counts of databases that were touched.

And possibly

c- think if we need to handle a "token not found" condition.

I won't touch score.c before 09:00 UTC so if you want to take care of
the "bogus msg_count", go ahead. The message_count tokens must either be
looked up alongside the token, or passed in from the caller - I think
the ds_txn_begin and _commit will be moved up the call stack, and the
.MSG_COUNT per list would only have to be looked up at the beginning of
a transaction, not once per other token that we query. I'll handle
getting the ds_txn stuff into the right places later.

-- 
Matthias Andree



More information about the bogofilter-dev mailing list