[PATCH] combined wordlist a.k.a. single list

Mon Jun 2 20:22:29 CEST 2003

On 20030602 (Mon) at 1046:22 -0700, Malcolm Dew-Jones wrote:
> 
> 
> 
> On Mon, 2 Jun 2003, Greg Louis wrote:
> 
> > On 20030601 (Sun) at 2131:55 -0700, Malcolm Dew-Jones wrote:
> > > 
> > > 
> > > My question/observation is this. 
> > > 
> > > 
> > > The size of the files in the new versions are what is my biggest concern. 
> > 
> > Why?  You need less disk space with one wordlist than with two.  
> 
> Maybe, maybe not.
> 
> A simple example
> 
> 	list 1
> 		word-1 counta
> 		word-2 counta
> 
> 	list 2
> 		word-3 countb
> 		word-4 countb
> 
> 	combined list
> 		word-1 counta countb
> 		word-2 counta countb
> 		word-3 counta countb 
> 		word-4 counta countb
> 
> My trivial example looks like it would be larger when combined.  It
> depends on the overlap of words in the lists.  Because my real lists are
> not the same size I know that at least one list must have a bunch of words
> that aren't in both files. 

You haven't considered the indexing overhead involved in keeping two
separate databases, but I understand the point.

> However, other feedback including your own below suggests this is not a
> problem in practise, so fine.
> 
> 
> > either; disk space is cheap these days, and the overhead of decoding
> > might offset any lookup time saved by having slightly smaller records. 
> 
> I am not worried about speed, only space.  With bogofilter.11.2, our
> goodlist.db file is more than 600 Megs in size and has about 12 million
> entries.  The files just become less practical to manipulate as they get
> larger.

Hm.  Your goodlist packs 21 times the tokens I've got into 55 times the
space.  It might take a while, but I would like to suggest that you try

for l in spam good; do
    bogoutil -d ${l}list.db | bogoutil -l ${l}list.new
    db_verify ${l}list.new
done

and see whether the .new files are a whole lot smaller than the
originals.  They might well be just over half the size, and work as
well as the big ones.  That's been my experience: loading a Berkeley db
in b-tree mode with pre-sorted records saves almost half the space
needed to load it with records at random.

-- 
| G r e g  L o u i s          | gpg public key: finger     |
|   http://www.bgl.nu/~glouis |   glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |