simplicity vs safety with complexity

David Relson relson at osagesoftware.com
Tue Jan 25 01:49:00 CET 2005


Greetings,

I've got a question for y'all.  Would you rather have 

1) a wordlist that's simple, easy to backup, but vulnerable to software
and hardware crashes; or

2) a wordlist that offers crash protection but is complex to maintain,
backup, ... 

Choice 1 refers to Berkeley DB without transactions (as found in
bogofilter through 0.92.8) and choice 2 refers to Berkeley DB _with_
transactions (0.93.0 and later).

Now that #2 is working well and people have experience with both
transactional and non-transactional versions, an informed choice can be
made.  This choice will affect the default mode for bogofilter 1.0 and
beyond. 

Please respond to this message and let the developers know your
preference!

Here's some background info:

Bogofilter's history as a spam filter is one of success and continued
improvement.  This is good!

Bogofilter's need for storing tokens has led to a series of database
implementations.  Each step in the series has provided a solution. Each
step has caused new problems.  Each step has required new solutions...

To recap the history briefly, bogofilter used the Judy database package
for its earliest versions.  Beginning with version 0.7.5 in October
2002, bogofilter switched to Berkeley DB for improved performance.

Over time it became apparent that there were some issues with Berkeley
DB of which two were significant.  With the locking available, any
number of programs could read the database, but writing it required a
single program to get exclusive read/write access.  Additionally, if a
program had the database open (for writing) and there was a program or
system crash, the database could become corrupt.

In many cases corrupt databases weren't noticed and bogofilter continued
to run and problems showed up only later.  The problems were often
spotted when dumping the database, and recovering often involved
rebuilding the database from raw mail folders.

Over time, people learned to deal with corruption problems by backing up
the database -- either by copying the file (wordlist.db) to another file
(or directory) or by dumping the file using bogoutil.

With the 0.93.0 release in November 2004, bogofilter started using
Berkeley DB's transaction capability, which offers significant
advantages.  The use of journalling log files ensures the database can
automatically be restored to a consistent state after program and system
crashes.  The use of lock tables allows fine grained locking of the
database which permits simultaneous reading and writing of it. Along
with these benefits came some new problems -- added complexity and
storage requirements as described below.

First, the data storage changed from a single file to a database
environment (directory) containing multiple files.  The environment
contains at least 9 files -- wordlist.db, 2 lockfiles (lockfile-d and
lockfile-p), 5 files named __db.001 to __db.005, and 1 or more logfiles
named log.NNNNNNNNNN.

Second, working with the new data environment requires understanding it.
 To aid in understanding it, substantial new documentation was added. 
The primary document is the 10 page README.db file.  As further aids,
helper scripts (bf_copy, bf_compact, bf_tar, etc) are included with
bogofilter.

Third, as messages are added to the database, the logfiles grow rapidly.
 This causes problems for some users.  Dealing with the logfiles
requires learning additional Berkeley DB commands such as db_checkpoint
and db_archive.

Fourth, as the database grows the default lock table size becomes
inadequate.  Unfamiliar error messages appear and cause confusion.
Correcting the problem requires adding a DB_CONFIG file to the
database environment.

To summarize several of the above items, the benefits of database
reliability through transactions come at the cost of increased
complexity and the need for substantial new learning to deal with the
new environment.

Effective use of bogofilter with transactions is forcing mail admins
into an additional role as data base admins, which is more than some
wish to deal with.  One option is having transactions disabled as the
default and, as database problems are encountered, have enabling
transactions as a solution.  This would allow easy initial use of
bogofilter and put off climbing the Berkeley DB learning curve until
it's necessary.  Briefly put, "old style for newbies, txn for the pros".

At this point, we need to look forward to bogofilter 1.0 and decide
which is the preferable default -- old style databases or new style
databases.  In either case configure's --disable-transactions and
--enable-transactions options allow a mail admin to select his/her modus
operandi.

Regards,

David




More information about the bogofilter-dev mailing list