[PATCH] combined wordlist a.k.a. single list
Greg Louis
glouis at dynamicro.on.ca
Mon Jun 2 20:22:29 CEST 2003
On 20030602 (Mon) at 1046:22 -0700, Malcolm Dew-Jones wrote:
>
>
>
> On Mon, 2 Jun 2003, Greg Louis wrote:
>
> > On 20030601 (Sun) at 2131:55 -0700, Malcolm Dew-Jones wrote:
> > >
> > >
> > > My question/observation is this.
> > >
> > >
> > > The size of the files in the new versions are what is my biggest concern.
> >
> > Why? You need less disk space with one wordlist than with two.
>
> Maybe, maybe not.
>
> A simple example
>
> list 1
> word-1 counta
> word-2 counta
>
> list 2
> word-3 countb
> word-4 countb
>
> combined list
> word-1 counta countb
> word-2 counta countb
> word-3 counta countb
> word-4 counta countb
>
> My trivial example looks like it would be larger when combined. It
> depends on the overlap of words in the lists. Because my real lists are
> not the same size I know that at least one list must have a bunch of words
> that aren't in both files.
You haven't considered the indexing overhead involved in keeping two
separate databases, but I understand the point.
> However, other feedback including your own below suggests this is not a
> problem in practise, so fine.
>
>
> > either; disk space is cheap these days, and the overhead of decoding
> > might offset any lookup time saved by having slightly smaller records.
>
> I am not worried about speed, only space. With bogofilter.11.2, our
> goodlist.db file is more than 600 Megs in size and has about 12 million
> entries. The files just become less practical to manipulate as they get
> larger.
Hm. Your goodlist packs 21 times the tokens I've got into 55 times the
space. It might take a while, but I would like to suggest that you try
for l in spam good; do
bogoutil -d ${l}list.db | bogoutil -l ${l}list.new
db_verify ${l}list.new
done
and see whether the .new files are a whole lot smaller than the
originals. They might well be just over half the size, and work as
well as the big ones. That's been my experience: loading a Berkeley db
in b-tree mode with pre-sorted records saves almost half the space
needed to load it with records at random.
--
| G r e g L o u i s | gpg public key: finger |
| http://www.bgl.nu/~glouis | glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |
More information about the bogofilter-dev
mailing list