Problem with wordlist.db berkely

David Relson relson at osagesoftware.com
Thu Aug 3 03:46:53 CEST 2006


On Wed, 2 Aug 2006 16:37:31 +0200
Belette wrote:

> Hello
> 
> I used db_verify on my wordlist.db
> 
> [root at chris1 ~]# /usr/local/BerkeleyDB.3.3-shared/bin/db_verify
> /mail/bogofilter//wordlist.db
> db_verify: Last item on page 11105 sorted greater than parent entry
> db_verify: Last item on page 11104 sorted greater than parent entry
> db_verify: Last item on page 11106 sorted greater than parent entry
> db_verify: First item on page 11104 sorted greater than parent entry
> db_verify: Page 11104 linked twice
> db_verify: Last item on page 11851 sorted greater than parent entry
> db_verify: First item on page 11106 sorted greater than parent entry
> db_verify: Page 11106 linked twice
> db_verify: First item on page 11109 sorted greater than parent entry
> db_verify: Page 11109 linked twice
> db_verify: First item on page 11105 sorted greater than parent entry
> db_verify: Page 11105 linked twice
> db_verify: DB->verify: /wanadoo/bogofilter/ukfilter/wordlist.db:
> DB_VERIFY_BAD: Database verification failed
> 
> 
> is there any way to solve this problem ? i tried this command :
> 
>  bogoutil -d wordlist.db | bogoutil -l wordlist.new.db
> 
> but this has no end... it started yesterday morning.. and is not
> finished yet.
> 
> Thx 4 ur help
> 
> Christophe

Hi Christophe,

Unfortunately, Berkeley DB databases break (on occasion).
Using the Berkeley DB transactional mode with bogofilter helps protect
against such problems.  Other protective methods are to dump the
database periodically, for example using bogoutil, and keeping the last
N copies.

As you've already gotten a broken database, you can use bogoutil
to dump the database to a text file.  As you've noticed, your database
seems to have a loop, so reading it never ends (although it will repeat
the "looped" part).  Running the following:

  bogoutil -d wordlist.db | tee wordlist.txt

will let you spot the repetition, at which point hitting 'ctrl-c' will
give you your wordlist (or part of it), with some duplication.  Using
"sort" can remove the duplication for rebuilding, i.e.

   sort -u < wordlist.txt | bogoutil -l wordlist.db.new

Experience indicates that even with only half a wordlist, for example
words "a..." through "m...", bogofilter can still do a good job of
classifying ham and spam.  Of course, you'll want to do more training
to improve bogofilter's accuracy.

HTH,

David



More information about the Bogofilter mailing list