cdb support: design question

Fri Jul 11 02:39:31 CEST 2003

Greg Louis <glouis at dynamicro.on.ca> writes:

> If I build a wordlist.raw with Matthias's dbdtocdb.pl script, and
> then do cdb -c cdb.cdb wordlist.raw, a working cdb database results. 
> The value field consists of two 32-bit integers (the spam and nonspam
> counts -- I'm working with the one-list patch and no timestamps).

The value is copied verbatim from the .DB.

> It's also possible to do a straight
>
> bogoutil -d wordlist.db | cdb -cm wordlist.cdb
>
> but in this case the value field consists of a string with the spam and
> nonspam counts in ASCII, separated by a space.
>
> the integer form).  The major advantage of the ASCII form is that the
> cdb tool produces a human-readable dump -- no need for special bogoutil
> code.  Reading the .cdb file with a hex editor is easier too.

The current .db files aren't dumpable with db_dump -p alone
either. You'll see the tokens, and junk from the count fields.

> I'm inclined, therefore, to pay the sscanf price and go with the ASCII
> version, but am open to arguments to the contrary if anybody feels
> strongly about it.

If CDB is about speed, then we shouldn't throw away the gains by adding
decimal numbers in plain text ASCII of all formats. We'll have to
provide tools to generate the input files anyhow to implement "training".

> Opinions, anyone?

-- 
Matthias Andree