Transactional Code and Disk Usage

David Relson relson at osagesoftware.com
Sat Sep 4 01:06:14 CEST 2004


Matthias,

bogofilter-0.92.6+txn2.1 is a distinct improvement over ...txn2.0.  In
my BerkeleyDB (4.1.25) environment, "make check" now passes all 37
regression tests and my private test is also successful.  Bravo!!

For those interested, here's the test I ran and the results ...

###### script 0903.sh ######

#!/bin/sh
DIR=`date +%m%d.d`
mkdir $DIR
bogofilter -v -C -d $DIR -n -I ~/Archive/2004/2004.08.h
bogofilter -v -C -d $DIR -s -I ~/Archive/2004/2004.08.s
ls -lh $DIR

###### output ######

# 4045684 words, 8076 messages
# 5007248 words, 19261 messages
total 133M
-rw-r--r--  1 relson relson 8.0K Sep  3 18:36 __db.001
-rw-r--r--  1 relson relson 5.1M Sep  3 18:36 __db.002
-rw-r--r--  1 relson relson  96K Sep  3 18:36 __db.003
-rw-r--r--  1 relson relson 368K Sep  3 18:36 __db.004
-rw-r--r--  1 relson relson  16K Sep  3 18:36 __db.005
-rw-r--r--  1 relson relson    0 Sep  3 18:28 lockfile-do-not-delete
-rw-r--r--  1 relson relson  10M Sep  3 18:29 log.0000000001
-rw-r--r--  1 relson relson  10M Sep  3 18:29 log.0000000002
-rw-r--r--  1 relson relson  10M Sep  3 18:29 log.0000000003
-rw-r--r--  1 relson relson  10M Sep  3 18:29 log.0000000004
-rw-r--r--  1 relson relson  10M Sep  3 18:29 log.0000000005
-rw-r--r--  1 relson relson  10M Sep  3 18:30 log.0000000006
-rw-r--r--  1 relson relson  10M Sep  3 18:30 log.0000000007
-rw-r--r--  1 relson relson  10M Sep  3 18:30 log.0000000008
-rw-r--r--  1 relson relson  10M Sep  3 18:30 log.0000000009
-rw-r--r--  1 relson relson 8.2M Sep  3 18:36 log.0000000010
-rw-r--r--  1 relson relson  29M Sep  3 18:36 wordlist.db

Now for some questions...  

'Tis my understanding that the __db.00x files are permanent additions
needed to support the transactional code.  It appears that they need
about 6M per wordlist, correct?

In this test, the log.NNNNNNNNNN files add another 98M for the test
wordlist (29M). This seems like a steep disk space penalty.  I suspect
it may be a problem for sites with per-account wordlists.  

Should there be an option to disable the transactional code for
space-challenged sites that are willing to live dangerously?

I've been reading the man page for db_archive.  As I understand it,
recovering from a catastrophic failure requires archiving the database
file (wordlist.db) and then archiving all the log.NNNNNNNNNN files. 
Again, this seems like a lot of file saving.

During the 18 months (or so) that bogofilter has been using BerkeleyDB,
people have developed techniques like saving the N daily snapshots of
wordlist.db so that, in case of a catastrophic problem, a recent copy of
wordlist.db is available to put in place.  

FWIW, my wordlist.db currently occupies 54Mb.  Using "bogoutil -d
wordlist.db > wordlist.txt" produces a 38Mb file.  Compressing with gzip
reduces the file to 11Mb, while bzip2 gives a 9.4Mb file.  It seems that
3 backups (hourly, daily, whatever...) will need about 50% of the
original space.

Anyhow, the transactional code seems to require a significant amount of
disk space to support it.  It may be necessary to document this and it
may be valuable to suggest alternative methods for maintaining wordlist
integrity.

David



More information about the Bogofilter mailing list