dump/load and db size (Re: What did I do wrong?)

Pavel Kankovsky peak at argo.troja.mff.cuni.cz
Thu Feb 19 18:02:10 CET 2004


On Thu, 19 Feb 2004 tallison at tacocat.net wrote:

> Why would a dump/load, without doing anything else to the data, make the
> database smaller?

The data in the database are organized as a B-tree. The nodes of the
B-tree are grouped into pages. When a new node is inserted into the tree
and a page the node should be added to is full (or does not have enough
free space), then the page is split into two pages. This means as much as
one half of the space can be empty after an arbitrary sequence of inserts.
This strategy is good to minimize the number of disk operations during a
random mix of inserts, deletes, and fetches but it is far from optimal in
other situations e.g. many fetches (Bogofilter testing mails) interrupted
by short bursts of random inserts (Bogofilter registering new spam/ham).

Dump/load rebuilds the tree from the scratch and reduces the amount of
unused space.

The actual amount of unused space can be determined using db_stat, e.g.

$ db_stat -d wordlist.db 
53162	Btree magic number.
8	Btree version number.
Flags:
2	Minimum keys per-page.
4096	Underlying database page size.
3	Number of levels in the tree.
128114	Number of unique keys in the tree.
128114	Number of data items in the tree.
7	Number of tree internal pages.
5026	Number of bytes free in tree internal pages (82% ff).
1088	Number of tree leaf pages.
59076	Number of bytes free in tree leaf pages (99% ff).
0	Number of tree duplicate pages.
0	Number of bytes free in tree duplicate pages (0% ff).
0	Number of tree overflow pages.
0	Number of bytes free in tree overflow pages (0% ff).
0	Number of pages on the free list.

ff stands for "fill-factor", ie. the fraction of space that is used for
data or B-tree data structures

(Yes, this db is quite small.)

--Pavel Kankovsky aka Peak  [ Boycott Microsoft--http://www.vcnet.com/bms ]
"Resistance is futile. Open your source code and prepare for assimilation."







More information about the Bogofilter mailing list