multiple wordlists vs. BerkeleyDB environments

David Relson relson at osagesoftware.com
Mon Nov 15 02:10:12 CET 2004


On Mon, 15 Nov 2004 01:33:37 +0100
Matthias Andree wrote:

> Ann Arbor, we've got a problem!
> 
> Apart from the open RFC-2047 issue, the multiple wordlist isn't
> working at this time except for multiple databases in the same
> directory and within the same environment (= __db.* and log.* files) -
> and it cannot be made to work without tipping the whole
> wordlists/datastore scheme as it stands now if we want to offer
> system-wide databases.
> 
> The "same directory" is an implementation artifact and not a
> limitation of the system - limitations are:
> 
> 1 - for a database to be shared *in a consistent way* between
>     applications, the applications must belong to and share *the same*
>     environment.
> 
> 2 - for consistency (i. e. not see bogus data in a reader while a
> writer
>     is in progress), we need transactions (concurrent datastore will
>     not suffice for lack of atomicity), unless we want to mutually
>     lock readers and writers potentially for extended periods of time,
>     so we cannot do without environment
> 
> 3 - this poses the new interesting question: access control.
> system-wide
>     databases would need to be writable for anyone - at least the
>     environment, and users could wreak havoc at will. Not exactly what
>     we want.
> 
> I see some ways out:
> 
> A - forget about shared wordlists, fix the "same directory" bug
>     and move on; applications with access to the same environment
>     trust each other implicitly.
> 
> B - full server-client model with access control or read-only access
>     that cares for the consistency, perhaps delivering ready-made
>     tokens. May need to be multithreaded, which opens a new can of
>     worms labeled "POSIX threads" unless we want to use a fork() model
>     which may imply awkward performance.
> 
> C - other suggestions?
> 
> B sounds too large for inclusion into 1.0, which must become ready
> some day. A should be quick to implement though.

Matthias,

>From the beginnings of a gdb session, it appears that environment
initialization, i.e. function db_xinit, is using the value of bogohome
and is ignoring the wordlist directives.  That's not what we want.

Not supporting shared lists, for example system and user lists, is also
not what we want.

Our configuration up through 0.92.8 supported multiple wordlists
(located wherever) and didn't support transactions.  Our configuration
since 0.93.0 supports transactions and multiple lists (but only if
they're in the same environment).  The environment limitation is not
good.  How hard is it to support multiple environments?  I'm not in
favor of access control or anything similar -- it's much too
complicated.

Bogofilter's purpose is scoring messages as spam and ham.  The bulk of
its code and its complexity should be oriented towards that purpose.  I
fear that bogofilter is becoming more and more involved in supporting
Berkeley DB's advanced capabilities.  Supporting transactions is
complicating support for every user and I fear it's off purpose for
bogofilter.  Remember, bogofilter is a spam filter, not a show case for
Berkeley DB.

Function db_xinit needs to use the proper path (database environment),
does it not?  Rather than lose multiple wordlist support, I'd rather
offer a configuration option for:

A - multiple wordlists, no transactions
B - transactions, single database environment

What do you think?

David



More information about the bogofilter-dev mailing list