[PATCH] combined wordlist a.k.a. single list

David Relson relson at osagesoftware.com
Mon Jun 2 22:06:09 CEST 2003


At 01:46 PM 6/2/03, Malcolm Dew-Jones wrote:



>On Mon, 2 Jun 2003, Greg Louis wrote:
>
> > On 20030601 (Sun) at 2131:55 -0700, Malcolm Dew-Jones wrote:
> > >
> > >
> > > My question/observation is this.
> > >
> > >
> > > The size of the files in the new versions are what is my biggest 
> concern.
> >
> > Why?  You need less disk space with one wordlist than with two.
>
>Maybe, maybe not.
>
>A simple example
>
>         list 1
>                 word-1 counta
>                 word-2 counta
>
>         list 2
>                 word-3 countb
>                 word-4 countb
>
>         combined list
>                 word-1 counta countb
>                 word-2 counta countb
>                 word-3 counta countb
>                 word-4 counta countb
>
>My trivial example looks like it would be larger when combined.  It
>depends on the overlap of words in the lists.  Because my real lists are
>not the same size I know that at least one list must have a bunch of words
>that aren't in both files.
>
>However, other feedback including your own below suggests this is not a
>problem in practise, so fine.
>
>
> > either; disk space is cheap these days, and the overhead of decoding
> > might offset any lookup time saved by having slightly smaller records.
>
>I am not worried about speed, only space.  With bogofilter.11.2, our
>goodlist.db file is more than 600 Megs in size and has about 12 million
>entries.  The files just become less practical to manipulate as they get
>larger.

Malcolm,

You're reporting wordlists an order of magnitude larger than any others of 
which I've heard.  I'd say your concerns about file size are valid :-)

Question:  have you tried using bogoutil's dump/load abilities to generate 
new wordlists?  It appears that database size drops significantly in some 
cases when you run "bogoutil -d old | bogoutil -l new".  Evidently as 
databases grow over time, space utilization is poor.  Writing the data out 
in order (as bogoutil -d does) and reading it into a new database can have 
a dramatic effect.  Today I reduced a 38M combined wordlist to 22M by doing 
just that.

David





More information about the bogofilter-dev mailing list