[long] Recovery handling in TXN branch

Fri Aug 20 11:16:50 CEST 2004

On Sat, 14 Aug 2004, Matthias Andree wrote:

> The problems we have are:
> 
> #1 detect an unclean exit
>     a - in a life system QUICKLY (ok, a minute should suffice)

Would it be acceptable to set a reasonably long alarm (e.g. 30 seconds)
before any db operation (of group of them) and conclude the db is
deadlocked and needs recovery when the alarm expires?

> [#1 continued]
>     b - after a system crash

> #3 have a lock mechanism to make sure only one process can run recovery
>    at a time

Here is my proposal:

There are three variables: a lock file (LCKF), active process table
(APRT), and a db-is-clean flag (CLNF).

LCKF can 1. be unlocked (no process accessing the db is running), 2. have
a single exclusive lock (a single process having an exclusive access to db
is running), or 3. have one or more shared locks.

APRT is an array of cells containing 0 or 1 where 0 is a free cell and 1
represents one process (dead or alive) working on the db. Processes can
get locks on individual cells of APRT. Locked 1 represents a live process,
unlocked 1--"a zombie cell"--represents a crashed processes.

CLNF is set when no processes are accessing the db, and all processes
working with it since the last recovery have exited cleanly.

Locks on LCKF and APRT and APRT values are transient. I assume locks are
removed automatically when processes holding them exit (or die). APRT may
be in an undefined state after reboot. Moreover, I assume that one of
multiple colliding attempts to acquire an exclusive lock on previously
unlocked LCKF will always be successful, even if the attempts are
nonblocking (this is a reasonable assumption but we all know OS
designers might use a different definition of "reasonable"...).

The value of CLNT should be persistent (or to be more precise, resetting
of CLNF should be persistent, the other change might be lost after a
reboot...at the expense of an unnecessary recovery).

And now, here the algoritm:

1. Whenever a process starts, it tries to acquire an exclusive lock on
LCKF. It does not wait for it.

1a. If it gets an exclusive lock, it sets one cell of APRT to 1, locks it,
and clears the rest. It checks CLNF and resets it (and makes this change
persistent). If CLNF was not set, it starts the recovery and waits until
the recovery completes successfully. It demotes its exclusive lock to a
shared lock and goes to step 2.

1b. If it cannot get an exclusive lock, it scans APRT for zombie cells.
If it finds such a cell, it goes back to step 1 immediately (recovery is
needed). Otherwise it acquires a shared lock on LCKF. If it cannot get a
shared lock it goes back to step 1. It checks CLNF, and scans APRT again.
If CLNF is set or if it finds a zombie cell during the 2nd pass, it
releases its shared lock and goes back to step 1. It picks a zero cell in
APRT, locks the cell (it will try again on another zero cell if it cannot
acquire the lock because some other process has already locked it), checks
that the zero is still there (if it is not, it will go back to step 1),
and replaces 0 with 1.

2. The process is holding a shared lock on LCKF here, and we can assume
the db is in a usable state. CLNF must be unset here. The process does
whatever it wants to do. It monitors APRT (e.g. periodically in a
signal handler) and aborts whenever it finds a zombie cell in APRT.

3. When the process finishes successfully, it clears its cell in APRT but
keeps it locked. It scans APRT and exits whenever it finds a nonzero cell.
If all cells are zero, it tests whether any of cells preceding its own
cell are locked and exits when it finds a locked cell. Otherwise it tries
to promote its shared to an exclusive lock. If it fails, it goes back to
APRT scan. If it gets an exclusive lock, it rechecks APRT and exists when
a nonzero cell is found. Otherwise it sets CLNF and exits.

Advantages:
- no dedicated watcher process (100% distributed logic)
- no signals (can work among processes running under different
  uids, no pid recycling races)
- extra synchronous disk i/o is minimized (in particular,
  there are no operations modifying directories if APRT is
  implemented as a single file).

Disadvantages:
- a little bit too complex (hmm...perhaps it could be made simpler
  if CLNF was removed and APRT was made "semipersisent"--i.e.
  0->1 would be written synchronously; unfortunately, this would
  increase the number of synchronous writes in a busy environment)

> #2 in case of an unclean exit, abort all other bogofilter processes
>    accessing the same data base
>    (DB_ENV->set_flags(env, ...DB_PANIC_ENVIRONMENT) should work)

Any process concluding the db is deadlocked (btw: is it possible to set
DB_PANIC_ENVIRONMENT from a signal handler?) would abort (or restart),
leaving a zombie cell in APRT. The other processes would abort (restart)
as soon as they spot it.

> If it finds one, a process has exited uncleanly (see below), the
> watcher creates the need-recovery file, unlinks the dbuser.$PID file and
> sets the "panic" flag in the environment, and then kills all processes
> that have a lock on the db* files and removes their dbuser.$PID files.

The watcher might see a locked pid file but the process might exit
(unlocking the file) and its pid might be recycled before the watcher
kills it. It is rather unlikely but it can happen.

> Then, the bogofilter checks if the needs-recovery file exists and if it
> does, tries a blocking exclusive lock on that file. If it is granted
> that lock, it checks if the file still exists and if it does, performs
> recovery. If the file doesn't exist, someone else has recovered the
> file, so release the lock and proceed.

I think there is a race condition here. needs-recovery might be created
after this check (and before dbuser.$PID is created) and there might be
one process performing the recovery while another process attempts to use
the database in an usual way.

On Mon, 16 Aug 2004, Matthias Andree wrote:

> I was under the impression that ext2, ext3, jfs, xfs, ufs and reiserfs
> track "needs fsck" (dirty) or "needs recovery" themselves.

Yes. They have a "dirty" (or "clean") flag in the superblock. The flag is
set (or reset) whenever the filesystem (re)mounted r/w, and reset (or set)
when it is unmounted or remounted r/o. This is similar to my CLNF (see
above).

On Mon, 16 Aug 2004, Gyepi SAM wrote:

> Perhaps it's time to investigate an alternative database system. I recommend
> SQLite (http://www.sqlite.com/). It has all the desired features with, AFAIK,
> none of the drawbacks of BDB.

SQLite is a nice engine (and it appears to be pretty fast...at least
compared to Oracle <g>) but I am afraid concurrent access (esp.
modifications) is not one of its strengths.

--Pavel Kankovsky aka Peak  [ Boycott Microsoft--http://www.vcnet.com/bms ]
"Resistance is futile. Open your source code and prepare for assimilation."