Transactional Code and Disk Usage

Matthias Andree matthias.andree at gmx.de
Sat Sep 4 01:41:19 CEST 2004


On Fri, 03 Sep 2004, David Relson wrote:

> bogofilter-0.92.6+txn2.1 is a distinct improvement over ...txn2.0.  In
> my BerkeleyDB (4.1.25) environment, "make check" now passes all 37
> regression tests and my private test is also successful.  Bravo!!

:-)

> Now for some questions...  
> 
> 'Tis my understanding that the __db.00x files are permanent additions
> needed to support the transactional code.  It appears that they need
> about 6M per wordlist, correct?

That depends on the exact size. One file is the environment info, one is
the memory pool ("cache"), one is the log info (the actual log is in the
log.* files obviously), one is the lock region and finally there is the
transactional region. The memory pool is large and a tuning factor and
the lock region will have to grow (via DB_CONFIG) as the data base
grows if we want bogoutil -d to work.

db_stat -h .bogofilter -e lists the __db.00X files in order.

The __db.* files are recreated when recovery is run.

> In this test, the log.NNNNNNNNNN files add another 98M for the test
> wordlist (29M). This seems like a steep disk space penalty.  I suspect
> it may be a problem for sites with per-account wordlists.  

The log files can be archived or backed up and then removed, the
db_checkpoint and db_archive utilities help finding out which files are
unused.

Common procedure to shrink the logs without data loss (need to be
gunzipped manually before a catastrophic recovery):

- cd ~/.bogofilter
- db_checkpoint -1
- db_archive | xargs gzip -v

More details are in the db_archive manual page. It sometimes takes two
or three reads until you've sorted which phrases belong to which option
unfortunately. My logs compress to about 29 % of their original size.

db_archive lists the log files that are no longer in use (db_checkpoints
puts older log files out of use).

An alternative would be to only dump the data base regularly and discard
unused log.* files (recent BerkeleyDB versions can do that
automatically), which prevents catastrophic recovery (but if the dump is
still there, it can be loaded; which requires manual intervention
though).

Note however that it is not strictly necessary to keep spam and ham
training sets around with a durable data base.

> Should there be an option to disable the transactional code for
> space-challenged sites that are willing to live dangerously?

I'll leave this to the users. Consider this a poll :)

> I've been reading the man page for db_archive.  As I understand it,
> recovering from a catastrophic failure requires archiving the database
> file (wordlist.db) and then archiving all the log.NNNNNNNNNN files. 
> Again, this seems like a lot of file saving.

Yup. After db_checkpoint, bogofilter should get along with a single log
file, the other log files are only ever needed again for catastrophic
recovery. I'd think restoring from a bogoutil -d output file by means of
bogoutil -l would be easier than the catastrophic recovery, and also
faster. It seems to me that the bogoutil approach for catastrophic
recovery is the pragmatic one and I'd probably take this. Note however
that the backup software can't just dump the .db file to tape, because
some information can be in the Mpool file still, so bogoutil -d would be
the right thing to do.

> During the 18 months (or so) that bogofilter has been using BerkeleyDB,
> people have developed techniques like saving the N daily snapshots of
> wordlist.db so that, in case of a catastrophic problem, a recent copy of
> wordlist.db is available to put in place.  

These should continue to work.

> Anyhow, the transactional code seems to require a significant amount of
> disk space to support it.  It may be necessary to document this and it
> may be valuable to suggest alternative methods for maintaining wordlist
> integrity.

I'd rather word this "restoring" wordlist integrity. I'll think about
additions for the README.db file on Sunday or next week.

-- 
Matthias Andree

Encrypted mail welcome: my GnuPG key ID is 0x052E7D95 (PGP/MIME preferred)



More information about the Bogofilter mailing list