[PATCH] combined wordlist a.k.a. single list

Malcolm Dew-Jones yf110 at victoria.tc.ca
Mon Jun 2 19:46:22 CEST 2003




On Mon, 2 Jun 2003, Greg Louis wrote:

> On 20030601 (Sun) at 2131:55 -0700, Malcolm Dew-Jones wrote:
> > 
> > 
> > My question/observation is this. 
> > 
> > 
> > The size of the files in the new versions are what is my biggest concern. 
> 
> Why?  You need less disk space with one wordlist than with two.  

Maybe, maybe not.

A simple example

	list 1
		word-1 counta
		word-2 counta

	list 2
		word-3 countb
		word-4 countb

	combined list
		word-1 counta countb
		word-2 counta countb
		word-3 counta countb 
		word-4 counta countb

My trivial example looks like it would be larger when combined.  It
depends on the overlap of words in the lists.  Because my real lists are
not the same size I know that at least one list must have a bunch of words
that aren't in both files. 

However, other feedback including your own below suggests this is not a
problem in practise, so fine.


> either; disk space is cheap these days, and the overhead of decoding
> might offset any lookup time saved by having slightly smaller records. 

I am not worried about speed, only space.  With bogofilter.11.2, our
goodlist.db file is more than 600 Megs in size and has about 12 million
entries.  The files just become less practical to manipulate as they get
larger.






More information about the bogofilter-dev mailing list