multilist scoring & txn performance

Wed Nov 17 17:59:37 CET 2004

David Relson <relson at osagesoftware.com> writes:

>> a- make sure that reading the tokens and the .MSG_COUNT happens in the
>>    same transaction so we aren't looking at bogus data when a large
>>    registration hits us in the middle of scoring - and checking where
>>    to move the ds_txn_() "brackets" I've seen that bogofilter goofs up
>>    the calculations.
>> 
>> b- sum up "local" msgs_good and msgs_bad in score.c that contain only
>>    the counts of databases that were touched.

> You're right about the global MSG_COUNT variable.  Good detective
> work!

Ok, b is fixed.

> With the transaction moved up the call chain, presumably to bracket the
> "foreach token" loop, MSG_COUNT can be read for each wordlist, then as
> each token is looked up, MSG_COUNT values can be totaled for the
> wordlists holding the word.

The transaction nesting issue is still open and requires major changes
to the code again, similar to those already made.

The problem is that transactions - with Berkeley DB - are local in the
_environment_ (not database!) and can thus span multiple databases in
the same environment. This is good, although not of advantage for our
current bogofilter model.

As a simplification, we can, I believe, safely map "Berkeley DB
environment" to "directory" files 1:1 for our purposes - there can be
only be one environment in a given directory.

I've figured that QDBM handles transactions differently, it's just a
rollback-from-application-crash feature.

I plan to expose environment-layer functions somewhat, add a list of
environments as we initialize wordlists, go (deriving them from the
"dirname" part of the _absolute_ database filename) and let each
wordlist know which environment it is in.

I'm not yet sure about how to distribute of responsibilities for opening
the environment, opening the wordlists and so on.

On related notes, I've seen some garbled code around fBogotune that I
find ugly and that makes code hard to maintain.  This appears as a poor
attempt to map the concept of function overloading to C, and some types
are heavily abused and mapped to non-compatible types as it seems.

We're also abusing enum types for bitfields, run_t for instance. When we
mean bitfield, we should write bitfield, i. e.

struct run_flags {
       bool      register_spam:   1;
       bool      register_ham:    1;
       bool      unregister_spam: 1;
       bool      unregister_ham:  1;
       bool      normal:          1;
       /* ... */
}

or similar.

We also have some code around that looks pretty low-level, like the
init_wordlist stuff that needs to manually implement a priority queue,
allocation issues and so on.

We should consider using higher-level languages than C, C++ with STL
comes to mind and rids us of several crutches, for instance all the
word_* blech - we have string, vector, queues, maps and so on. GCC
works, there is STLport, and after all, we aren't forced to support
systems that don't have a decent C++ implementation.

I'm suggesting C++ because it's widely ported and allows a gradual
migration.

-- 
Matthias Andree