A weird wordlist.db problem

David Relson relson at osagesoftware.com
Fri Jun 10 13:22:31 CEST 2005


On Fri, 10 Jun 2005 15:17:17 +1200
Tom Eastman wrote:

> Here's an interesting one... bogofilter has worked beautifully for me for
> years, but Berkeley always seems to be a nightmare.
> 
> Berkeley is some kind of tree structure, right?  Well it looks kind of like
> I have a branch pointing back in to itself or something like that...
> 
> If I attempt a 'bogoutil -d wordlist.db', the output simply continues
> forever, my wordlist.db is about five megabytes, but I killed the dump once
> the dumpfile had reached half a *gigabyte*.
> 
> Doing something similar for a while, and then running 'sort | uniq -c -d'
> showed large (LARGE) numbers of duplicate lines in the output file.
> 
> Clearly there is some kind of loop in the data structure causing bogoutil to
> run forever.  I *like* my current wordlist, and things, oddly enough, still
> seem to work, as far as learning and classification is concerned.
> 
> How can I fix this?  How can I recover my database to the point at least
> where I can do a dump/reload and make it healthy again?
> 
> Thanks,
> 
>         Tom

Hi Tom,

You've got a classic case of corrupted database, with one of the
internal links being bad.  Running db_verify will confirm that for you.

One solution is to rebuild from scratch.  Take all your saved ham and
spam and build a new wordlist.  That's the simplest thing, though
probably not what you want to hear.

When "bogoutil -d" is dumping the database, the tokens are in
alphabetical order.  How far through the alphabet does the endless dump
get?  That'll tell you how much of your wordlist you can recover.

To recover, what I suggest is running "bogoutil -d wordlist.db" and
killing it when the output file reaches 10 MB.  That will include
everything that Berkeley DB will let you see.  Run "sort | uniq" to
eliminate duplications.  Finally run "bogoutil -l wordlist.db.new".  A
bit of renaming and you should be good to go.  You'll have lost some of
your tokens, but bogofilter will work (at somewhat reduced accuracy).

Traditionally folks have dumped their database periodically to have a
backup.  A weekly cron job like the following works well:

   DATE=`date +%m%d`
   bogoutil -d wordlist.db > wordlist.txt.$DATE

The current stable version of bogofilter supports Berkeley DB's
transactional capability which provides significant safeguards against
corruption like you've encountered.  It uses a logfile to ensure
against problems should the application (bogofilter) crash during a
write.  Transactions also allow multiple programs to simultaneously
read/write the database.  Neat stuff.

I've outlined two options for creating a working wordlist and two for
keeping it working.  I'm sure others will chime in with additional
thoughts.

HTH,

David



More information about the Bogofilter mailing list