Robust databases (was Re: locking)

Tim Freeman tim at fungible.com
Tue Mar 4 05:35:26 CET 2003


From: Matthias Andree <matthias.andree at gmx.de>
>We should see after 0.11 if using some other means (transactional store
>or PostgreSQL) is acceptable performance-wise, installation effort wise
>and if it is more "survivable". Ideally, our users wouldn't have to
>rebuild the data base.

A memory-mapped lossy hash as is used in crm114 ought to be robust
against program crashes or even machine crashes, so long as the bytes
read from the file are bytes that were written to the same position in
the file at some time in the past.  You can also do concurrent reads
and writes without locking.  (Concurrent writes might miscount
occasionally, but who cares?  It's just one data point.)  You can even
go randomly alter a few bytes and the behavior of the program doesn't
change much.

If instead you hash your key to a bin of key-value pairs, then when
the bin overflows you could discard the least interesting tokens from
that bin.  You could save room by storing the hash of the key instead
of the key.  In any case you'd have at most one disk access per query.
The table would have enough information to let you rehash it to make
it larger or smaller.

Hmm, if you don't store a date, it's unclear how a new word would have
a chance to become more interesting than the old words we know
something about.  You'd have to come up with a definition of
"interesting" that gives new words a chance.  If we've seen the word
"oprah" in one new spam and we've seen the word "intelligent" in 536
spams and 470 good mails, I suppose the new word "oprah" should
replace the old word "intelligent".

Or you could just do LRU in each bin.  But that requires writing the
database when you read it.  You could do probabilistic-LRU, where with
some low probability you move the recently accessed word to the
beginning of the list.  That should be nearly as good as LRU, but you
usually don't have to write the database when you read it.  Maybe the
probability of being moved to the front of the list is higher for
interesting words than uninteresting words?

If you store the hash of the key instead of the key, you'd be giving
up the ability to extract a word list from the database.  Is that good
for anything?
-- 
Tim Freeman       
tim at fungible.com
GPG public key fingerprint ECDF 46F8 3B80 BB9E 575D  7180 76DF FE00 34B1 5C78 




More information about the Bogofilter mailing list