database rebuild [was: Unregistering Mail]
Nick Simicich
njs at scifi.squawk.com
Sat Feb 8 09:41:18 CET 2003
At 08:20 PM 2003-02-07 -0500, David Relson wrote:
>At 05:14 AM 2/7/03, Nick Simicich wrote:
>>At 03:47 PM 2003-02-06 -0500, David Relson wrote:
>>>Do people want the ability to unregister mail? If so what would be the
>>>preferred way to do it? One of the above suggestions or something different?
>>
>>I do not think that unregistering is that important, although there is a
>>circumstance that I would have used it in. However, at this point,
>>(since I am using -u) I have many thousands of messages in the corpus. I
>>see that you have begun dating the entries so that old entries can be
>>expired. I am "reorganizing my database" - at this point, I decided to
>>do it by renaming the old corpus and doing a
>>
>>bogoutil -d goodlist.old.db | bogoutil -l goodlist.db
>>
>>I tried just dumping it, but I ran out of space at about 100 meg. The db
>>file fits in under 15 meg. The load is still running, and it has cranked
>>for over 95 minutes of CPU at this point. I believe it is running,
>>because if it was blocking, in a loop, the dump would not be running, and
>>it has consumed about 23 minutes of CPU time itself - CPU is split 75%
>>load - 20% dump. (that does not go to 100%, the CPU is idle about .3%,
>>the rest is other stuff).
>>
>>This seems like a lot, I will wait for it for a while longer. I am not
>>running any e-mail on the system. I wonder if this is just the way it
>>is, or if there is any other way?
>
>Nick,
>
>Sounds like your running time for the dump/load is excessive. "Bogoutil
>-d" is quick and its output file should be smaller than the database
>because it doesn't have as much overhead. "Bogoutil -l" is also quick
>(with a sorted input file, like that produced by dump). It sounds like
>you have a database problem. Have you run db_verify, or other utility, to
>check the database integrity?
Um, no.
[njs at scifi .bogofilter]$
/var/spool/news/bogofilter/db-4.1.24/build_unix/db_verify goodlist.db
db_verify: Page 190: out-of-order key at entry 79
db_verify: Page 245: out-of-order key at entry 103
db_verify: Page 1104: out-of-order key at entry 138
db_verify: Page 1794: out-of-order key at entry 28
db_verify: Page 1794: out-of-order key at entry 78
db_verify: Page 2123: out-of-order key at entry 16
db_verify: Page 2701: out-of-order key at entry 4
db_verify: Page 2701: out-of-order key at entry 118
db_verify: Page 2843: out-of-order key at entry 8
db_verify: DB->verify: goodlist.db: DB_VERIFY_BAD: Database verification failed
[njs at scifi .bogofilter]$
/var/spool/news/bogofilter/db-4.1.24/build_unix/db_verify spamlist.db
db_verify: Page 198: out-of-order key at entry 112
db_verify: Page 199: out-of-order key at entry 174
db_verify: DB->verify: spamlist.db: DB_VERIFY_BAD: Database verification failed
[njs at scifi .bogofilter]$
This does not look good.
With my spamlist.db, I was able to:
bogoutil -d spamlist.db | sort -u | bogoutil -l spamlist.new.db
and then the new version verified.
I tried /var/spool/news/bogofilter/db-4.1.24/build_unix/db_dump -p -f
goodlist.out goodlist.db
I noted after it had accumulated about 30 meg of output that when I ran
this command:
[root at scifi bogofilter-0.9.1.2]# grep process-accounting-4.html
~njs/.bogofilter/goodlist.out | wc -l
151
[root at scifi bogofilter-0.9.1.2]#
A display shows that the entire key is repeated. There is a loop...the
data and the key are repeated.
/var/spool/news/bogofilter/db-4.1.24/build_unix/db_dump -r -p -f
goodlist.out goodlist.db
Gives me about 10 meg of output. That particular key appears only once.
I run this:
[njs at scifi .bogofilter]$
/var/spool/news/bogofilter/db-4.1.24/build_unix/db_load -t btree -f
goodlist.out goodlist.new.db
and then finally this:
bogoutil -d goodlist.new.db | bogoutil -l goodlist.new2.db
File sizes, just for grins. The database produced by bogoutil is about
2/3rds the size of the one made by db_load.
njs at scifi .bogofilter]$ du -h -a
2.6K ./goodlist
1.3K ./badlist
380K ./spamlist.db
173K ./spamlist.serial
173K ./spamlist.sorted
14M ./goodlist.db
1.4M ./spamlist.bad.db
14M ./goodlist.backup.db
9.7M ./goodlist.out
16M ./goodlist.new.db
10M ./goodlist.new2.db
66M .
[njs at scifi .bogofilter]$ ls -l
total 67866
-rw-rw-rw- 1 njs wheel 1305 Dec 4 03:21 badlist
-rw-rw-rw- 1 njs wheel 2671 Dec 4 03:21 goodlist
-rw-r--r-- 1 njs wheel 14696448 Feb 8 02:22 goodlist.backup.db
-rw-r--r-- 1 njs wheel 14696448 Feb 8 02:33 goodlist.db
-rw-r--r-- 1 njs wheel 16650240 Feb 8 02:55 goodlist.new.db
-rw-r--r-- 1 njs wheel 10776576 Feb 8 03:05 goodlist.new2.db
-rw-r--r-- 1 njs wheel 10215565 Feb 8 02:49 goodlist.out
-rw-r--r-- 1 njs wheel 1425408 Feb 8 02:13 spamlist.bad.db
-rw-r--r-- 1 njs wheel 389120 Feb 8 02:16 spamlist.db
-rw-r--r-- 1 njs wheel 177325 Feb 8 02:11 spamlist.serial
-rw-r--r-- 1 njs wheel 177325 Feb 8 02:12 spamlist.sorted
[njs at scifi .bogofilter]$
So, yes, something, perhaps filling the filesystem, caused all sorts of
problems. Standard dumps, either the one that came with db or yours were
worthless because of the looping, and I honestly do not understand if there
is any possibility of dealing with the issue properly. The recovery dump
worked and I was able to rebuild something that looked good and get going
again.
Just for grins, I tried a -R recovery. That produced about half again as
much data output. I decided not to use that since I figured that it was,
as the man page suggested, mostly garbage.
Thanks for the db_verify suggestion, that got me pointed down the "road to
recovery".
By the way, as far as comment syntax goes: Piotr KUCHARSKI ran the page
through Opera 6 and sent me the output: It gets all cases correct except
for 8, the nested comment case, strengthening the case against closing
comments on naked >. I noted that there was another text based browser
included in Linux when Redhat released a fix for it - w3m, which is
supposed to "do the right thing" when displaying html and plain text. It
works properly with cases 1-7, and does what Opera does with case 8: It
closes on the first -->, not properly dealing with nested comments.
--
SPAM: Trademark for spiced, chopped ham manufactured by Hormel.
spam: Unsolicited, Bulk E-mail, where e-mail can be interpreted generally
to mean electronic messages designed to be read by an individual, and it
can include Usenet, SMS, AIM, etc. But if it is not all three of
Unsolicited, Bulk, and E-mail, it simply is not spam. Misusing the term
plays into the hands of the spammers, since it causes confusion, and
spammers thrive on confusion. Spam is not speech, it is an action, like
theft, or vandalism. If you were not confused, would you patronize a spammer?
Nick Simicich - njs at scifi.squawk.com - http://scifi.squawk.com/njs.html
Stop by and light up the world!
More information about the Bogofilter
mailing list