database rebuild [was: Unregistering Mail]

Nick Simicich njs at scifi.squawk.com
Sat Feb 8 09:41:18 CET 2003


At 08:20 PM 2003-02-07 -0500, David Relson wrote:

>At 05:14 AM 2/7/03, Nick Simicich wrote:
>>At 03:47 PM 2003-02-06 -0500, David Relson wrote:
>>>Do people want the ability to unregister mail?  If so what would be the 
>>>preferred way to do it?  One of the above suggestions or something different?
>>
>>I do not think that unregistering is that important, although there is a 
>>circumstance that I would have used it in.  However, at this point, 
>>(since I am using -u) I have many thousands of messages in the corpus. I 
>>see that you have begun dating the entries so that old entries can be 
>>expired.  I am "reorganizing my database" - at this point, I decided to 
>>do it by renaming the old corpus and doing a
>>
>>bogoutil -d goodlist.old.db | bogoutil -l goodlist.db
>>
>>I tried just dumping it, but I ran out of space at about 100 meg. The db 
>>file fits in under 15 meg.  The load is still running, and it has cranked 
>>for over 95 minutes of CPU at this point.  I believe it is running, 
>>because if it was blocking, in a loop, the dump would not be running, and 
>>it has consumed about 23 minutes of CPU time itself - CPU is split 75% 
>>load - 20% dump. (that does not go to 100%, the CPU is idle about .3%, 
>>the rest is other stuff).
>>
>>This seems like a lot, I will wait for it for a while longer.  I am not 
>>running any e-mail on the system.  I wonder if this is just the way it 
>>is, or if there is any other way?
>
>Nick,
>
>Sounds like your running time for the dump/load is excessive.  "Bogoutil 
>-d" is quick and its output file should be smaller than the database 
>because it doesn't have as much overhead.   "Bogoutil -l" is also quick 
>(with a sorted input file, like that produced by dump).  It sounds like 
>you have a database problem.  Have you run db_verify, or other utility, to 
>check the database integrity?

Um, no.

[njs at scifi .bogofilter]$ 
/var/spool/news/bogofilter/db-4.1.24/build_unix/db_verify goodlist.db
db_verify: Page 190: out-of-order key at entry 79
db_verify: Page 245: out-of-order key at entry 103
db_verify: Page 1104: out-of-order key at entry 138
db_verify: Page 1794: out-of-order key at entry 28
db_verify: Page 1794: out-of-order key at entry 78
db_verify: Page 2123: out-of-order key at entry 16
db_verify: Page 2701: out-of-order key at entry 4
db_verify: Page 2701: out-of-order key at entry 118
db_verify: Page 2843: out-of-order key at entry 8
db_verify: DB->verify: goodlist.db: DB_VERIFY_BAD: Database verification failed
[njs at scifi .bogofilter]$ 
/var/spool/news/bogofilter/db-4.1.24/build_unix/db_verify spamlist.db
db_verify: Page 198: out-of-order key at entry 112
db_verify: Page 199: out-of-order key at entry 174
db_verify: DB->verify: spamlist.db: DB_VERIFY_BAD: Database verification failed
[njs at scifi .bogofilter]$

This does not look good.

With my spamlist.db, I was able to:

bogoutil -d spamlist.db | sort -u | bogoutil -l spamlist.new.db

and then the new version verified.

I tried /var/spool/news/bogofilter/db-4.1.24/build_unix/db_dump -p -f 
goodlist.out  goodlist.db

I noted after it had accumulated about 30 meg of output that when I ran 
this command:

[root at scifi bogofilter-0.9.1.2]# grep  process-accounting-4.html 
~njs/.bogofilter/goodlist.out | wc -l
     151
[root at scifi bogofilter-0.9.1.2]#

A display shows that the entire key is repeated.  There is a loop...the 
data and the key are repeated.

/var/spool/news/bogofilter/db-4.1.24/build_unix/db_dump -r  -p -f 
goodlist.out  goodlist.db

Gives me about 10 meg of output. That particular key appears only once.

I run this:

[njs at scifi .bogofilter]$ 
/var/spool/news/bogofilter/db-4.1.24/build_unix/db_load -t btree -f 
goodlist.out  goodlist.new.db

and then finally this:

  bogoutil -d goodlist.new.db | bogoutil -l goodlist.new2.db

File sizes, just for grins.  The database produced by bogoutil is about 
2/3rds the size of the one made by db_load.

njs at scifi .bogofilter]$ du -h -a
2.6K    ./goodlist
1.3K    ./badlist
380K    ./spamlist.db
173K    ./spamlist.serial
173K    ./spamlist.sorted
14M     ./goodlist.db
1.4M    ./spamlist.bad.db
14M     ./goodlist.backup.db
9.7M    ./goodlist.out
16M     ./goodlist.new.db
10M     ./goodlist.new2.db
66M     .
[njs at scifi .bogofilter]$ ls -l
total 67866
-rw-rw-rw-   1 njs      wheel        1305 Dec  4 03:21 badlist
-rw-rw-rw-   1 njs      wheel        2671 Dec  4 03:21 goodlist
-rw-r--r--   1 njs      wheel    14696448 Feb  8 02:22 goodlist.backup.db
-rw-r--r--   1 njs      wheel    14696448 Feb  8 02:33 goodlist.db
-rw-r--r--   1 njs      wheel    16650240 Feb  8 02:55 goodlist.new.db
-rw-r--r--   1 njs      wheel    10776576 Feb  8 03:05 goodlist.new2.db
-rw-r--r--   1 njs      wheel    10215565 Feb  8 02:49 goodlist.out
-rw-r--r--   1 njs      wheel     1425408 Feb  8 02:13 spamlist.bad.db
-rw-r--r--   1 njs      wheel      389120 Feb  8 02:16 spamlist.db
-rw-r--r--   1 njs      wheel      177325 Feb  8 02:11 spamlist.serial
-rw-r--r--   1 njs      wheel      177325 Feb  8 02:12 spamlist.sorted
[njs at scifi .bogofilter]$

So, yes, something, perhaps filling the filesystem, caused all sorts of 
problems. Standard dumps, either the one that came with db or yours were 
worthless because of the looping, and I honestly do not understand if there 
is any possibility of dealing with the issue properly.  The recovery dump 
worked and I was able to rebuild something that looked good and get going 
again.

Just for grins, I tried a -R recovery.  That produced about half again as 
much data output.  I decided not to use that since I figured that it was, 
as the man page suggested, mostly garbage.

Thanks for the db_verify suggestion, that got me pointed down the "road to 
recovery".

By the way, as far as comment syntax goes:  Piotr KUCHARSKI ran the page 
through Opera 6 and sent me the output: It gets all cases correct except 
for 8, the nested comment case, strengthening the case against closing 
comments on naked >.  I noted that there was another text based browser 
included in Linux when Redhat released a fix for it - w3m, which is 
supposed to "do the right thing" when displaying html and plain text. It 
works properly with cases 1-7, and does what Opera does with case 8: It 
closes on the first -->, not properly dealing with nested comments.

--

SPAM: Trademark for spiced, chopped ham manufactured by Hormel.
spam: Unsolicited, Bulk E-mail, where e-mail can be interpreted generally 
to mean electronic messages designed to be read by an individual, and it 
can include Usenet, SMS, AIM, etc.  But if it is not all three of 
Unsolicited, Bulk, and E-mail, it simply is not spam. Misusing the term 
plays into the hands of the spammers, since it causes confusion, and 
spammers thrive on  confusion. Spam is not speech, it is an action, like 
theft, or vandalism. If you were not confused, would you patronize a spammer?
Nick Simicich - njs at scifi.squawk.com - http://scifi.squawk.com/njs.html
Stop by and light up the world!



More information about the Bogofilter mailing list