catastrophic recovery required

Matthias Andree matthias.andree at gmx.de
Thu Feb 17 00:39:56 CET 2005


Ben Finney <ben at benfinney.id.au> writes:

> Thanks for the ongoing assistance.

Welcome. Besides, this appears to be a fundamental issue (barring faulty
memory or overclocked CPU or memory, that is) that hasn't been discussed
in breadth here yet.

>> 3. ext3fs with data=journal mount option SHOULD be able to write memory
>>    page size (4k) atomically AFAIR.
>
> I'm using ext3, with no particular options fiddling.  I thought the
> point of a journalled filesystem was that *all* data writes are
> atomic?

Unfortunately, it isn't. All journalling file systems I know, that is
ext3fs, jfs, xfs, reiserfs, logging ufs (Solaris) journal metadata only,
that is directory structure and such. ext3fs is the only file system
that goes beyond that by its data=ordered (default) and data=journal
modes (it also has a data=writeback mode that makes it metadata
only). data=ordered means user data is written before the references
(filenames for instance) to it, which helps with appending data,
data=journal explained below.

The journal allows to re-do actions that may have been lost, but
journalling file systems REQUIRE that the drive performs writes in the
same order as requested and also REQUIRE that there is no cheating WRT
completion of write operations.

Write caches deceive the file system and may introduce unnoticed
inconsistencies. Write caches also deceive 

> What do I have if I'm not using the "data=journal" option?  (URLs with
> answers to these questions are welcome.)

data=ordered. That means safe file appends as long as the write cache is
off, but no atomicity guarantees beyond aligned access to a single disk
block. For safe "in-the-middle" updates, data=journal would be
required. OTOH, with the write cache turned off, Berkeley DB should be
able to recover automatically.

> My understanding was that, since 'bogoutil --db-prune' gave no errors,
> my data was safe.  Backups were kept at one time, but removed once it
> was clear to me that 'bogoutil --db-prune' was an atomic operation.

A crash with reordered writes in progress can cause errors to go
unnoticed until much later.

>> Did you disable the write caches as laid out in README.db? If you did
>> not, that may have been the cause for the troubles you're witnessing.
>> I suggest disabling the write caches if you have flakey power supply.
>
> I don't have a flakey power supply, but like all power systems (even
> ones in well-provisioned grids) it is vulnerable to lightning storms.

Oops, I must have used a wrong term then, perhaps grid or electricity
network describe better what I meant. I did not mean your PSU unit that
turns 110 or 230 V into -12, -5, 3.3, 5 and 12 V.

>> At any rate, check the README.db revision I posted on 2005-02-05 for
>> recovery procedures.
>
> Shall do.  Thanks again.

See appendix A for help on the write caches.

> I believe you that this is supposed to be rare; ideally, impossible.
> How can we make it closer to that ideal?

- make sure drive doesn't reorder writes (by switching off the write
  cache until the write barrier support in the Linux kernel is proven,
  in mainline, and deployed on your system)

- try to configure database page size to what the filesystem can write
  atomically (I posted a patch in response to Tom's post), 512 bytes is
  the value to use for regular hard disk drives, ATA as SCSI.

- try to configure file system to support larger atomic writes that are
  the size of the database page. data=journal on ext3fs is good for 4 k
  database page size AFAIR

> Should bogofilter not continue to operate even in the presence of write
> caching, especially if filesystem-level recovery is successful?

Impossible. Modern drives with their 2 - 8 MB caches, can lose that
amount of data, after having told the computer "yes I've written that
data". And drives will also rearrange block write order to speed writes
up, for good throughput. Looks good in the benchmarks but is dangerous.

-- 
Matthias Andree



More information about the Bogofilter mailing list