A weird wordlist.db problem

Sat Jun 11 03:28:35 CEST 2005

On Sat, 11 Jun 2005 02:13:25 +1200
Tom Eastman wrote:

> David Relson wrote:
> 
> > You've got a classic case of corrupted database, with one of the
> > internal links being bad.  Running db_verify will confirm that for you.
> > 
> > One solution is to rebuild from scratch.  Take all your saved ham and
> > spam and build a new wordlist.  That's the simplest thing, though
> > probably not what you want to hear.
> 
> Cool thanks for the advice, and I'll look into trying it out in the morning. 
> In the meantime though, I'd just like to say:
> 
> @#$@ #$!$ @#$! #@#@!!!!!!!!!
> 
> AAAAAAaaaaaaaaaaaarrrrrrrgggghhhhh!!
> 
> A pox on all things Berkeley DB related!  What good does it do other than
> cause poor innocent people like me to lose databases?  The mantra I think I
> get is something like "Berkeley DB is a great database that can withstand
> all kinds of things without getting corrupted... as long as you don't do
> any of the hundreds of things that will immediately and irreparably destroy
> your database..."
> 
> Sorry for the rant, this is a few months after I switched all my subversion
> repositories over to fsfs because a similar problem ate all six of them. 
> What is Berkeley db buying bogofilter?  It doesn't seem to be peace of
> mind.
> 
> Okay.. I'm calm again.  Is there really no way to repair the database? 
> Surely if only one branch is broken, the rest should be repairable?  on
> *either* side of the corrupted area?  You say that 'bogoutil -d' dumps
> things in order... maybe I can hack the source code to make it dump in
> *reverse* order?

db_dump will output the database, one entry per line.  The "gotcha" is
that it's all hex.  A converter could be written pretty easily to
generate lines in "bogoutil -d" format.  I've no info on how db_dump
affects the bad link.

An idea I had on the way to work is based on the fact that dumping the
wordlist proceeds in alphabetical order.  Adding a check for out of
order tokens _might_ let bogoutil extract all that's properly available
and then exit cleanly (rather than loop and write everything again and
again).

If you'd care to gzip your wordlist.db and email it to me (or upload to
ftp://ftp.osagesoftware.com/pub/incoming), I could take run some
experiments.

> My knowledge of the database is pretty minimal... but my database itself is
> only about 5 megabytes... I'm more than happy to make it available if
> someone can get me a dump of it that works.
> 
> Thanks for your help.. I think I should sleep now so I can have a slightly
> cooler look at it in the morning :-) In the meantime, it's still not urgent
> since, quite frankly, it's still classifying mails perfectly :-)

Not too surprising actually.  The bad link has the effect of keeping
the first part of the alphabet and losing the last part.  Where the
first/last boundary actually is may not be important since the
proportion of spam to ham tokens is likely similar between words
starting with "A", or with "B", or with "x".  Stated differently,
deleting the end of the wordlist likely deletes comparable quantities
of hammish and spammish tokens.  Interesting, eh?