cdb support: design question

Greg Louis glouis at dynamicro.on.ca
Thu Jul 10 15:12:58 CEST 2003


It appears that Michael Tokarev's public-domain implementation of
the cdb spec works ok, except if the cdb tool is invoked with -u while
creating; in that case it aborts with a bogus error message partway
through the build.  (This is version 0.72, which seems to be the latest
release.)

If I build a wordlist.raw with Matthias's dbdtocdb.pl script, and
then do cdb -c cdb.cdb wordlist.raw, a working cdb database results. 
The value field consists of two 32-bit integers (the spam and nonspam
counts -- I'm working with the one-list patch and no timestamps).

It's also possible to do a straight

bogoutil -d wordlist.db | cdb -cm wordlist.cdb

but in this case the value field consists of a string with the spam and
nonspam counts in ASCII, separated by a space.

In terms of CPU overhead at read time, the integer form is preferable
(it saves a cdb_datalen() call and a sscanf() call for each lookup). 
The ASCII form, in my case, turns out to be just over 11% more compact
than the other, but that's not a very important difference (much more
important is that my 30Mb wordlist.db yields a 4Mb .cdb file even in
the integer form).  The major advantage of the ASCII form is that the
cdb tool produces a human-readable dump -- no need for special bogoutil
code.  Reading the .cdb file with a hex editor is easier too.

I'm inclined, therefore, to pay the sscanf price and go with the ASCII
version, but am open to arguments to the contrary if anybody feels
strongly about it.

Opinions, anyone?
-- 
| G r e g  L o u i s          | gpg public key: finger     |
|   http://www.bgl.nu/~glouis |   glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |




More information about the bogofilter-dev mailing list