[long] Recovery handling in TXN branch

Sat Aug 21 22:58:51 CEST 2004

On Fri, 20 Aug 2004, Pavel Kankovsky wrote:

> 1. Whenever a process starts, it tries to acquire an exclusive lock on
> LCKF. It does not wait for it.
> 
> 1a. If it gets an exclusive lock, it sets one cell of APRT to 1, locks it,
> and clears the rest. It checks CLNF and resets it (and makes this change

Oops. The program holding an exclusive lock can crash before it 
reinitializes APRT. Another process might fail to get an exclusive lock
but get a shared lock (after the 1st process crashes) and think it is ok 
to use the db even when the recovery is needed.

What a stupid mistake! How embarrasing!

Here is a revised and simplified algorithm (without CLNF--its maintenance
was too complex and error-prone):

1. Check APRT for zombie cells. Go to 3 if no zombie cells are found.
2. Acquire an exclusive lock on LCKF. Go back to 1 if you fail.
   Re-check APRT and go to 3 unless any zombie cells are found.
   Otherwise, run the recovery, clear APRT, demote the exlusive lock on 
   LCKF to a shared lock and go to 4.
3. Acquire a shared lock on LCKF. Go back to 1 if you fail.
4. Find a free APRT cell, lock it, check that it is still free
   (abort if it is not). Put 1 into the cell, and commit the change
   to non-volatile memory.
5. Do the work. Monitor APRT in the meantime, abort when a zombie
   cell is found.
6. Clear your APRT cell, release locks on APRT and LCKF.

Ok, I admit it. This algorithm is nothing but a refinement of Matthias' 
one.

--Pavel Kankovsky aka Peak  [ Boycott Microsoft--http://www.vcnet.com/bms ]
"Resistance is futile. Open your source code and prepare for assimilation."