DB corruption within minutes

Sat Jan 11 13:59:05 CET 2003

On Sat, 11 Jan 2003, Gyepi SAM wrote:

> > OK, I have 16 hours of serialized bogofilter now, with flock() in my
> > .mailfilter surrounding it, that means I have at most one bogofilter
> > running at a time. I has gone without corruption, with near 900 mails
> > passing through it, which formerly would have killed the db in less than
> > two hours.
> 
> Sounds pretty compelling.
> 
> > I have a strong suspect: our db locking. It always hits my goodlist.db,
> > never my spamlist.db (but then again, there's still more ham than spam
> > here), and bogofilter running in -u)pdate mode.
> 
> Surely, it must hit the spamlist.db sometime. Also, we lock the
> goodlist.db first becuase that's first in the list.
>  
> > I have added t.lock2 and made minor fixes to bogofilter.
> 
> I noticed. I also change the grind loop to run for 1000 iterations,
> and found no problems.

Even 8 did the job on my system SuSE Linux 8.1, Duron/700? UWSCSI hard
drive, DB-4.0.14, Kernel 2.4.19, ext3fs.

What's the difference to your system? Slower machine? Slower hard drive?
Faster machine? different OS? different DB version?

> > I'm wondering if we can do things the way we do. We release the lock
> > before closing the data base, which does not look right. BDB has
> > internal caches (like stdio) that need to be flushed by ->close before
> > we can let go of the lock, so the ordering would likely have to be
> > db_close, db_lock_release, but db_close kills the handle, so all of this
> > should be more tightly integrated.
> 
> The only solutions I can think of are
> 1. call open (2) on the database ourselves, so we have a handle to lock
> 2. use an external lockfile.

An external lockfile is the global lock we don't want to use for
scalability reasons. I've also though about integrating the locking with
db_open.

> > Plus, I believe we cannot release the lock, have someone else update the
> > db and then grab the lock again to proceed. The pages may have changed,
> > so we have inconsistent cache/disk data.
> 
> If we call db_sync() after updating the database but before releasing
> the lock, that should fix any syncronization problems of that sort.

That's only one half. The other half would be to make the other data
bases that have waited for the lock flush /their/ caches, but I don't
currently see how that would be done other than with DB->close and
DB->open.

It's sort of euhm unhelpful if you cannot reproduce the problem, because
it seems you have most experience with BDB of all active bogofilter
hackers. I seem to have grasped the basics though.

-- 
Matthias Andree