multilist scoring & txn performance

David Relson relson at osagesoftware.com
Wed Nov 17 03:26:31 CET 2004


On Wed, 17 Nov 2004 01:59:18 +0100
Matthias Andree wrote:

> David Relson <relson at osagesoftware.com> writes:
> 
> > I've spotted something odd in function lookup() in file score.c.  In
> > outline form, the code is:
> >
> >     for (list=word_lists; list != NULL; list=list->next)
> >     {
> > 	ds_txn_begin(list->dsh)
> > 	ret = ds_read(list->dsh, token, &val);
> > 	ds_txn_commit(list->dsh)
> >     }
> 
> Tracing this, I found a _severe_ flaw in multiple-wordlist code that
> throws off our calculations when more than one wordlist is in use
> except when all have the same "override" value.
> 
> We're using a priority queue and summing over all lists at the same
> "override" level as the first to have the token, no?
> 
> 
> Anyways, we're calculating (b/g = bad/good count, mb/mg = bad/good
> messages total; s/x are Robinson's values) - see prob.c:
> 
>       b/mb
> p :=  -----------
>       b/mb + g/mg
> 
> 
>       s*x + (b+g)*p
> f :=  -------------
>       s   + (b+g)
> 
> What we're currently doing is sum up the individual list .MSG_COUNT
> values to get mb/mg. We are _NOT_ summing up the relevant .MSG_COUNT,
> but are using the sum of _all_ lists, so the token b/g values are
> bogus when we're skipping any of the lists.
> 
> We _must_ eliminate the globals msgs_good/msgs_bad if we want
> multiple_wordlist code to Do The Right Thing[tm], and pass individual
> message good/bad counts depending on how many and which of the lists
> we've looked at.
> 
> The integral step to
> 
> a- make sure that reading the tokens and the .MSG_COUNT happens in the
>    same transaction so we aren't looking at bogus data when a large
>    registration hits us in the middle of scoring - and checking where
>    to move the ds_txn_() "brackets" I've seen that bogofilter goofs up
>    the calculations.
> 
> b- sum up "local" msgs_good and msgs_bad in score.c that contain only
>    the counts of databases that were touched.
> 
> And possibly
> 
> c- think if we need to handle a "token not found" condition.
> 
> I won't touch score.c before 09:00 UTC so if you want to take care of
> the "bogus msg_count", go ahead. The message_count tokens must either
> be looked up alongside the token, or passed in from the caller - I
> think the ds_txn_begin and _commit will be moved up the call stack,
> and the.MSG_COUNT per list would only have to be looked up at the
> beginning of a transaction, not once per other token that we query.
> I'll handle getting the ds_txn stuff into the right places later.

Matthias,

You're right about the global MSG_COUNT variable.  Good detective work!

With the transaction moved up the call chain, presumably to bracket the
"foreach token" loop, MSG_COUNT can be read for each wordlist, then as
each token is looked up, MSG_COUNT values can be totaled for the
wordlists holding the word.

I don't anticipate time this evening, so I'll let you take on both
tasks.

Ciao,

David



More information about the bogofilter-dev mailing list