[long] Recovery handling in TXN branch

Fri Aug 20 11:50:39 CEST 2004

On Fri, 20 Aug 2004, Pavel Kankovsky wrote:

> On Sat, 14 Aug 2004, Matthias Andree wrote:
> 
> > The problems we have are:
> > 
> > #1 detect an unclean exit
> >     a - in a life system QUICKLY (ok, a minute should suffice)
> 
> Would it be acceptable to set a reasonably long alarm (e.g. 30 seconds)
> before any db operation (of group of them) and conclude the db is
> deadlocked and needs recovery when the alarm expires?

I'd wondered about that. How to we figure an adequate timeout? How do we
know what timeouts external software, for instance maildrop or Postfix,
impose?

> Here is my proposal:
> 
> There are three variables: a lock file (LCKF), active process table
> (APRT), and a db-is-clean flag (CLNF).
> 
> LCKF can 1. be unlocked (no process accessing the db is running), 2. have
> a single exclusive lock (a single process having an exclusive access to db
> is running), or 3. have one or more shared locks.
> 
> APRT is an array of cells containing 0 or 1 where 0 is a free cell and 1
> represents one process (dead or alive) working on the db. Processes can
> get locks on individual cells of APRT. Locked 1 represents a live process,
> unlocked 1--"a zombie cell"--represents a crashed processes.
> 
> CLNF is set when no processes are accessing the db, and all processes
> working with it since the last recovery have exited cleanly.
> 
> Locks on LCKF and APRT and APRT values are transient. I assume locks are
> removed automatically when processes holding them exit (or die).

They are.

> APRT may
> be in an undefined state after reboot. Moreover, I assume that one of
> multiple colliding attempts to acquire an exclusive lock on previously
> unlocked LCKF will always be successful, even if the attempts are
> nonblocking (this is a reasonable assumption but we all know OS
> designers might use a different definition of "reasonable"...).

As far as I know from documentation and experience, yes. First come,
first served, no surprises here.

> Advantages:
> - no dedicated watcher process (100% distributed logic)
> - no signals (can work among processes running under different
>   uids, no pid recycling races)

We'd still need internal signals such as SIGALRM to run the watcher
every 30 seconds or so.

> - extra synchronous disk i/o is minimized (in particular,
>   there are no operations modifying directories if APRT is
>   implemented as a single file).
> 
> Disadvantages:
> - a little bit too complex (hmm...perhaps it could be made simpler
>   if CLNF was removed and APRT was made "semipersisent"--i.e.
>   0->1 would be written synchronously; unfortunately, this would
>   increase the number of synchronous writes in a busy environment)
> 
> > #2 in case of an unclean exit, abort all other bogofilter processes
> >    accessing the same data base
> >    (DB_ENV->set_flags(env, ...DB_PANIC_ENVIRONMENT) should work)
> 
> Any process concluding the db is deadlocked (btw: is it possible to set
> DB_PANIC_ENVIRONMENT from a signal handler?)

No DB calls are permitted from signal handlers, BerkeleyDB is not
re-entrant. We can only crash when we find that another process has
exited uncleanly.

Also, setting DB_PANIC_ENVIRONMENT does not abort existing processes.

> > If it finds one, a process has exited uncleanly (see below), the
> > watcher creates the need-recovery file, unlinks the dbuser.$PID file and
> > sets the "panic" flag in the environment, and then kills all processes
> > that have a lock on the db* files and removes their dbuser.$PID files.
> 
> The watcher might see a locked pid file but the process might exit
> (unlocking the file) and its pid might be recycled before the watcher
> kills it. It is rather unlikely but it can happen.

I don't see how that would be a problem, if we scan for "need recovery"
before locking our own dbuser.$PID file.

> > Then, the bogofilter checks if the needs-recovery file exists and if it
> > does, tries a blocking exclusive lock on that file. If it is granted
> > that lock, it checks if the file still exists and if it does, performs
> > recovery. If the file doesn't exist, someone else has recovered the
> > file, so release the lock and proceed.
> 
> I think there is a race condition here. needs-recovery might be created
> after this check (and before dbuser.$PID is created) and there might be
> one process performing the recovery while another process attempts to use
> the database in an usual way.

Maybe. I wonder about how your single-file APRT is handled. Will it be
grown on demand, i. e. if I don't find a cell, reopen with O_APPEND and
write one?

I don't see much difference in complexity between your and my solution,
and I like the idea of letting the process check for others from a
signal handler, and that we can resort to fdatasync() which will be
cheaper than a directory synch.

I'll think about this and then see which route I'll take.

Thank you!

-- 
Matthias Andree

NOTE YOU WILL NOT RECEIVE MY MAIL IF YOU'RE USING SPF!
Encrypted mail welcome: my GnuPG key ID is 0x052E7D95 (PGP/MIME preferred)