multiple wordlists vs. BerkeleyDB environments

Matthias Andree matthias.andree at gmx.de
Mon Nov 15 04:00:55 CET 2004


David Relson <relson at osagesoftware.com> writes:

> From the beginnings of a gdb session, it appears that environment
> initialization, i.e. function db_xinit, is using the value of bogohome
> and is ignoring the wordlist directives.  That's not what we want.

The stripping of the directory happens in db_open, we need to strip the
bogohome directory off from the filename, for Berkeley DB considers all
paths relative to the database home directory (we use bogohome for
that).

> Not supporting shared lists, for example system and user lists, is also
> not what we want.
>
> Our configuration up through 0.92.8 supported multiple wordlists
> (located wherever) and didn't support transactions.

Right. It required that the system database could only be updated when
there was no reader, and it wasn't crash-proof.

> Our configuration since 0.93.0 supports transactions and multiple
> lists (but only if they're in the same environment).  The environment
> limitation is not good.  How hard is it to support multiple
> environments?

It's easy enough (let db_init return a handle and move more global data
into handle-local data), but it doesn't take us anywhere.

The problem is the trust that sharing Berkeley DB environments
imply. You must make the environment writable, and that won't work for
"system" databases.

> I'm not in favor of access control or anything similar -- it's much
> too complicated.

We cannot offer access control as long as we give direct access to the
database.

> Bogofilter's purpose is scoring messages as spam and ham.  The bulk of
> its code and its complexity should be oriented towards that purpose.  I
> fear that bogofilter is becoming more and more involved in supporting
> Berkeley DB's advanced capabilities.  Supporting transactions is
> complicating support for every user and I fear it's off purpose for
> bogofilter.  Remember, bogofilter is a spam filter, not a show case for
> Berkeley DB.

What is implemented is the bare necessity for a robust database, not a
iota more. We have 19,500 lines of code in src/, datastore_db.c and
db_lock together are less than 8% of that.

> Function db_xinit needs to use the proper path (database environment),
> does it not?  Rather than lose multiple wordlist support, I'd rather
> offer a configuration option for:
>
> A - multiple wordlists, no transactions
> B - transactions, single database environment
>
> What do you think?

I don't like A. Especially centralized databases must work at every
time. Imagine the database crashing at Friday 5.01 pm... users who need
mail over the week-end will have to remove the central database from
their bogofilter setup. Transactions are the thing that enables "system"
wordlists in the long run if we get the access control (libwrap.a might
work, or user ID checking on UNIX-domain sockets) and isolation right.

B - easy enough, will look into that this week. This may involve API
    cleanup changes. Some issues are that we handle too much global data
    that needs to be passed in, and that we cannot run recovery on
    databases (QDBM wants that), just the environment -- for that
    reason, db_recover is a misnomer, it should be dbe_recover; and it
    must take the bogohome directory as an argument.

The other problem is that bogofilter works best for the one user who
trained it, and is inferior with site-wide files.

My suggestion:

C - drop wordlist sharing when user IDs are different,
    fix wordlist sharing for single environment as above,

    implement multi-threaded client-server model for bogofilter 1.1,
    which would then allow distributed and centralized/site-wide databases.

-- 
Matthias Andree



More information about the bogofilter-dev mailing list