Transactional Code and Disk Usage

David Relson relson at osagesoftware.com
Sat Sep 4 02:19:24 CEST 2004


On Sat, 4 Sep 2004 01:41:19 +0200
Matthias Andree wrote:

> On Fri, 03 Sep 2004, David Relson wrote:

...[snip]...

> > Now for some questions...  
> > 
> > 'Tis my understanding that the __db.00x files are permanent
> > additions needed to support the transactional code.  It appears that
> > they need about 6M per wordlist, correct?
> 
> That depends on the exact size. One file is the environment info, one
> is the memory pool ("cache"), one is the log info (the actual log is
> in the log.* files obviously), one is the lock region and finally
> there is the transactional region. The memory pool is large and a
> tuning factor and the lock region will have to grow (via DB_CONFIG) as
> the data base grows if we want bogoutil -d to work.
> 
> db_stat -h .bogofilter -e lists the __db.00X files in order.
> 
> The __db.* files are recreated when recovery is run.
> 
> > In this test, the log.NNNNNNNNNN files add another 98M for the test
> > wordlist (29M). This seems like a steep disk space penalty.  I
> > suspect it may be a problem for sites with per-account wordlists.  
> 
> The log files can be archived or backed up and then removed, the
> db_checkpoint and db_archive utilities help finding out which files
> are unused.

I hadn't noticed the db_checkpoint command.  'Tis an additional man page
to read.  Guess it's time to point my browser at
/usr/share/doc/db4-utils-4.1.25/utility/index.html and see what's there.

Frankly when I chose to learn about identifying spam I didn't realize
I'd have to learn the ins and outs of a database system.  Drat :-<

> Common procedure to shrink the logs without data loss (need to be
> gunzipped manually before a catastrophic recovery):
> 
> - cd ~/.bogofilter
> - db_checkpoint -1
> - db_archive | xargs gzip -v

Perhaps this should become the db.compact script :-)

> More details are in the db_archive manual page. It sometimes takes two
> or three reads until you've sorted which phrases belong to which
> option unfortunately. My logs compress to about 29 % of their original
> size.
> 
> db_archive lists the log files that are no longer in use
> (db_checkpoints puts older log files out of use).
> 
> An alternative would be to only dump the data base regularly and
> discard unused log.* files (recent BerkeleyDB versions can do that
> automatically), which prevents catastrophic recovery (but if the dump
> is still there, it can be loaded; which requires manual intervention
> though).

Trade-off -- Save space vs manual intervention ...

> Note however that it is not strictly necessary to keep spam and ham
> training sets around with a durable data base.

Understood !

> > Should there be an option to disable the transactional code for
> > space-challenged sites that are willing to live dangerously?
> 
> I'll leave this to the users. Consider this a poll :)

Absolutely.  Let them vote!

> > I've been reading the man page for db_archive.  As I understand it,
> > recovering from a catastrophic failure requires archiving the
> > database file (wordlist.db) and then archiving all the
> > log.NNNNNNNNNN files. Again, this seems like a lot of file saving.
> 
> Yup. After db_checkpoint, bogofilter should get along with a single
> log file, the other log files are only ever needed again for
> catastrophic recovery. I'd think restoring from a bogoutil -d output
> file by means of bogoutil -l would be easier than the catastrophic
> recovery, and also faster. It seems to me that the bogoutil approach
> for catastrophic recovery is the pragmatic one and I'd probably take
> this. Note however that the backup software can't just dump the .db
> file to tape, because some information can be in the Mpool file still,
> so bogoutil -d would be the right thing to do.

"bogoutil -d | bogoutil -l" has the nice effect of producing a smaller
wordlist -- same set of tokens in fewer disk pages.  That's a plus.  The
downside is that it can take a while for a large db.

> > During the 18 months (or so) that bogofilter has been using
> > BerkeleyDB, people have developed techniques like saving the N daily
> > snapshots of wordlist.db so that, in case of a catastrophic problem,
> > a recent copy of wordlist.db is available to put in place.  
> 
> These should continue to work.

Yes.

> > Anyhow, the transactional code seems to require a significant amount
> > of disk space to support it.  It may be necessary to document this
> > and it may be valuable to suggest alternative methods for
> > maintaining wordlist integrity.
> 
> I'd rather word this "restoring" wordlist integrity. I'll think about
> additions for the README.db file on Sunday or next week.

"restoring" is good!

-- 
David Relson                   Osage Software Systems, Inc.
relson at osagesoftware.com       Ann Arbor, MI 48103
www.osagesoftware.com          tel:  734.821.8800



More information about the Bogofilter mailing list