[PATCH] combined wordlist a.k.a. single list

Greg Louis glouis at dynamicro.on.ca
Mon Jun 2 13:22:24 CEST 2003


On 20030601 (Sun) at 2131:55 -0700, Malcolm Dew-Jones wrote:
> 
> 
> My question/observation is this. 
> 
> 
> The size of the files in the new versions are what is my biggest concern. 

Why?  You need less disk space with one wordlist than with two.  The
classification runs faster with one wordlist than with two (if your db
cache size is right).  What, then, is the problem?
> 
> Will the combining of the files make them smaller? 

When I did mine, I had about 980,000 tokens total in spamlist.db plus
goodlist.db, and I ended up with about 940,000 in wordlist.db (there
were only 40,000 words common to spam and nonspam, in other words.

> If the spam/ham files share a lot of words them the combined file should
> be smaller, but for us, one of the lists (I think the ham words, but can't
> check right now) is much larger, so in this case a combined file might be
> much bigger than a single file. 

The combined file was much bigger (30 Mb) than either single file (21
Mb and 11 Mb), but somewhat smaller than the sum of the two.  So I
saved around 2Mb out of 32.

> The encoding of the number could make a difference.  If the database used
> a flag to indicate the meaning of the number then this might not be an
> issue.  For example, if the first bit in the first byte of the first
> number used a bit to indicate whether it was a ham or spam count, and the
> length of the data after the word was used to implicitly indicate the
> number of counts, then singleton words would only take up the same space
> as they do now in a single file (though the maximum word count would be
> reduced by a factor of two, though that probably makes no practical
> difference). 

I'd be surprised if that encoding made much practical difference
either; disk space is cheap these days, and the overhead of decoding
might offset any lookup time saved by having slightly smaller records. 
If you'd like to code it and compare, though, it might prove me wrong
and would in any case be interesting.

-- 
| G r e g  L o u i s          | gpg public key: finger     |
|   http://www.bgl.nu/~glouis |   glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |




More information about the bogofilter-dev mailing list