multilist scoring & txn performance
David Relson
relson at osagesoftware.com
Wed Nov 17 03:26:31 CET 2004
On Wed, 17 Nov 2004 01:59:18 +0100
Matthias Andree wrote:
> David Relson <relson at osagesoftware.com> writes:
>
> > I've spotted something odd in function lookup() in file score.c. In
> > outline form, the code is:
> >
> > for (list=word_lists; list != NULL; list=list->next)
> > {
> > ds_txn_begin(list->dsh)
> > ret = ds_read(list->dsh, token, &val);
> > ds_txn_commit(list->dsh)
> > }
>
> Tracing this, I found a _severe_ flaw in multiple-wordlist code that
> throws off our calculations when more than one wordlist is in use
> except when all have the same "override" value.
>
> We're using a priority queue and summing over all lists at the same
> "override" level as the first to have the token, no?
>
>
> Anyways, we're calculating (b/g = bad/good count, mb/mg = bad/good
> messages total; s/x are Robinson's values) - see prob.c:
>
> b/mb
> p := -----------
> b/mb + g/mg
>
>
> s*x + (b+g)*p
> f := -------------
> s + (b+g)
>
> What we're currently doing is sum up the individual list .MSG_COUNT
> values to get mb/mg. We are _NOT_ summing up the relevant .MSG_COUNT,
> but are using the sum of _all_ lists, so the token b/g values are
> bogus when we're skipping any of the lists.
>
> We _must_ eliminate the globals msgs_good/msgs_bad if we want
> multiple_wordlist code to Do The Right Thing[tm], and pass individual
> message good/bad counts depending on how many and which of the lists
> we've looked at.
>
> The integral step to
>
> a- make sure that reading the tokens and the .MSG_COUNT happens in the
> same transaction so we aren't looking at bogus data when a large
> registration hits us in the middle of scoring - and checking where
> to move the ds_txn_() "brackets" I've seen that bogofilter goofs up
> the calculations.
>
> b- sum up "local" msgs_good and msgs_bad in score.c that contain only
> the counts of databases that were touched.
>
> And possibly
>
> c- think if we need to handle a "token not found" condition.
>
> I won't touch score.c before 09:00 UTC so if you want to take care of
> the "bogus msg_count", go ahead. The message_count tokens must either
> be looked up alongside the token, or passed in from the caller - I
> think the ds_txn_begin and _commit will be moved up the call stack,
> and the.MSG_COUNT per list would only have to be looked up at the
> beginning of a transaction, not once per other token that we query.
> I'll handle getting the ds_txn stuff into the right places later.
Matthias,
You're right about the global MSG_COUNT variable. Good detective work!
With the transaction moved up the call chain, presumably to bracket the
"foreach token" loop, MSG_COUNT can be read for each wordlist, then as
each token is looked up, MSG_COUNT values can be totaled for the
wordlists holding the word.
I don't anticipate time this evening, so I'll let you take on both
tasks.
Ciao,
David
More information about the bogofilter-dev
mailing list