txn performance penalty
Matthias Andree
matthias.andree at gmx.de
Wed Nov 17 01:59:18 CET 2004
David Relson <relson at osagesoftware.com> writes:
> I've spotted something odd in function lookup() in file score.c. In
> outline form, the code is:
>
> for (list=word_lists; list != NULL; list=list->next)
> {
> ds_txn_begin(list->dsh)
> ret = ds_read(list->dsh, token, &val);
> ds_txn_commit(list->dsh)
> }
Tracing this, I found a _severe_ flaw in multiple-wordlist code that
throws off our calculations when more than one wordlist is in use except
when all have the same "override" value.
We're using a priority queue and summing over all lists at the same
"override" level as the first to have the token, no?
Anyways, we're calculating (b/g = bad/good count, mb/mg = bad/good
messages total; s/x are Robinson's values) - see prob.c:
b/mb
p := -----------
b/mb + g/mg
s*x + (b+g)*p
f := -------------
s + (b+g)
What we're currently doing is sum up the individual list .MSG_COUNT
values to get mb/mg. We are _NOT_ summing up the relevant .MSG_COUNT,
but are using the sum of _all_ lists, so the token b/g values are bogus
when we're skipping any of the lists.
We _must_ eliminate the globals msgs_good/msgs_bad if we want
multiple_wordlist code to Do The Right Thing[tm], and pass individual
message good/bad counts depending on how many and which of the lists
we've looked at.
The integral step to
a- make sure that reading the tokens and the .MSG_COUNT happens in the
same transaction so we aren't looking at bogus data when a large
registration hits us in the middle of scoring - and checking where to
move the ds_txn_() "brackets" I've seen that bogofilter goofs up the
calculations.
b- sum up "local" msgs_good and msgs_bad in score.c that contain only
the counts of databases that were touched.
And possibly
c- think if we need to handle a "token not found" condition.
I won't touch score.c before 09:00 UTC so if you want to take care of
the "bogus msg_count", go ahead. The message_count tokens must either be
looked up alongside the token, or passed in from the caller - I think
the ds_txn_begin and _commit will be moved up the call stack, and the
.MSG_COUNT per list would only have to be looked up at the beginning of
a transaction, not once per other token that we query. I'll handle
getting the ds_txn stuff into the right places later.
--
Matthias Andree
More information about the bogofilter-dev
mailing list