[PATCH] combined wordlist a.k.a. single list
David Relson
relson at osagesoftware.com
Mon Jun 2 22:06:09 CEST 2003
At 01:46 PM 6/2/03, Malcolm Dew-Jones wrote:
>On Mon, 2 Jun 2003, Greg Louis wrote:
>
> > On 20030601 (Sun) at 2131:55 -0700, Malcolm Dew-Jones wrote:
> > >
> > >
> > > My question/observation is this.
> > >
> > >
> > > The size of the files in the new versions are what is my biggest
> concern.
> >
> > Why? You need less disk space with one wordlist than with two.
>
>Maybe, maybe not.
>
>A simple example
>
> list 1
> word-1 counta
> word-2 counta
>
> list 2
> word-3 countb
> word-4 countb
>
> combined list
> word-1 counta countb
> word-2 counta countb
> word-3 counta countb
> word-4 counta countb
>
>My trivial example looks like it would be larger when combined. It
>depends on the overlap of words in the lists. Because my real lists are
>not the same size I know that at least one list must have a bunch of words
>that aren't in both files.
>
>However, other feedback including your own below suggests this is not a
>problem in practise, so fine.
>
>
> > either; disk space is cheap these days, and the overhead of decoding
> > might offset any lookup time saved by having slightly smaller records.
>
>I am not worried about speed, only space. With bogofilter.11.2, our
>goodlist.db file is more than 600 Megs in size and has about 12 million
>entries. The files just become less practical to manipulate as they get
>larger.
Malcolm,
You're reporting wordlists an order of magnitude larger than any others of
which I've heard. I'd say your concerns about file size are valid :-)
Question: have you tried using bogoutil's dump/load abilities to generate
new wordlists? It appears that database size drops significantly in some
cases when you run "bogoutil -d old | bogoutil -l new". Evidently as
databases grow over time, space utilization is poor. Writing the data out
in order (as bogoutil -d does) and reading it into a new database can have
a dramatic effect. Today I reduced a 38M combined wordlist to 22M by doing
just that.
David
More information about the bogofilter-dev
mailing list