[PATCH] combined wordlist a.k.a. single list

David Relson relson at osagesoftware.com
Mon Jun 2 14:15:16 CEST 2003


At 12:31 AM 6/2/03, Malcolm Dew-Jones wrote:

>My question/observation is this.
>
>The size of the files in the new versions are what is my biggest concern.
>
>Will the combining of the files make them smaller?

Hello Malcom,

The short answer is "Not significantly".

Each token entry in the current spamlist.db and goodlist.db has two or 
three parts - the token's text, a 4 byte count, and (optionally/normally) a 
4 byte timestamp.  The biggest part of the entry is, of course, the token's 
text.

In the combined wordlist, there is still one entry per token, but it 
contains two 4 byte counts - one for the spam count and one for the 
non-spam count.  So for tokens that were in both lists there's a space 
savings, but for tokens that were only in one list, extra space is needed - 
specifically the 4 bytes for the second count.

Here are some numbers from my mail server:

             count    size
spamlist - 197,431   6.9M
goodlist - 395,142    13M
wordlist - 551,672    22M

As you can see, the combined size is about the same while the wordcount is 
slightly less.

Below is an improved script for merging wordlists.  The last "bogoutil -d | 
bogoutil -l" step is an optimization that significantly reduces the final 
database size (from 32M to 22M for me).  At the end of the run, it shows 
word counts and database sizes.

Hope this helps.

David


#!/bin/sh
#
#  merge

BOGODIR="/path/to/wordlists"
BOGOUTIL="/path/to/bogoutil"

cd $BOGODIR
rm -f wordlist.db

( $BOGOUTIL -d goodlist.db | awk '{ printf "%s  0 %d %d\n", $1, $2, $3}' | 
tee good ; \
   $BOGOUTIL -d spamlist.db | awk '{ printf "%s %d  0 %d\n", $1, $2, $3}' | 
tee spam ) \
| sort | tee word | $BOGOUTIL -l wordlist.tmp

$BOGOUTIL -d wordlist.tmp | tee temp | $BOGOUTIL -l wordlist.db

ls -lh ????list.db ????list.tmp
wc -l  good spam word temp

rm -f good spam word temp wordlist.tmp









More information about the bogofilter-dev mailing list