Ideas wanted for TXN and Concurrent store recovery handling

Mon Jul 26 14:47:58 CEST 2004

Hi,

I'm looking for ideas for handling recovery in the
transactional/concurrent branches of bogofilter.

These branches use the native locking mechanism of the Berkeley DB for
efficiency. It may happen in a crash or with forced abortion of a
bogofilter process that data base locks aren't cleared, and all
subsequent attempts to run bogofilter will then wait for the release of
a lock that will never happen. The remedy is to stop all
bogofilter/bogoutil processes, prevent new ones from being started by
stopping the mail system, then run db_recover -h .bogofilter, and
restart the mail system.

My goal is to have bogofilter detect such timeouts itself and set a
marker so it can run recovery, in order to not get stuck for a long
time. The traditional code used fcntl-style locks that clear
automatically when the application holding them quits, either orderly or
through crash or force.

I currently see two approaches to attack the problem:

1. We can use a timer. This raises a new problem: how do I figure if the
   operation is just slow or is stuck? If anyone knows a way to figure,
   please speak up.

2. We can use a sophisticated lock protocol for any bogofilter process
   that is about to open the data base, to make sure that only one
   process can run the recovery process, and run the recovery process as
   part of bogofilter's startup sequence whenever an exclusive lock can
   be acquired; a process that was about to do recovery would try an
   exclusive lock, a process that will just read from or modify the data
   base can use a shared lock (to avoid running recovery in the middle
   of another process modifying the data base).

Ideas are welcome. Please direct your replies to the
bogofilter-dev at bogofilter.org mailing list.

-- 
Matthias Andree

Encrypted mail welcome: my GnuPG key ID is 0x052E7D95 (PGP/MIME preferred)