[PATCH] combined wordlist a.k.a. single list
David Relson
relson at osagesoftware.com
Mon Jun 2 14:15:16 CEST 2003
At 12:31 AM 6/2/03, Malcolm Dew-Jones wrote:
>My question/observation is this.
>
>The size of the files in the new versions are what is my biggest concern.
>
>Will the combining of the files make them smaller?
Hello Malcom,
The short answer is "Not significantly".
Each token entry in the current spamlist.db and goodlist.db has two or
three parts - the token's text, a 4 byte count, and (optionally/normally) a
4 byte timestamp. The biggest part of the entry is, of course, the token's
text.
In the combined wordlist, there is still one entry per token, but it
contains two 4 byte counts - one for the spam count and one for the
non-spam count. So for tokens that were in both lists there's a space
savings, but for tokens that were only in one list, extra space is needed -
specifically the 4 bytes for the second count.
Here are some numbers from my mail server:
count size
spamlist - 197,431 6.9M
goodlist - 395,142 13M
wordlist - 551,672 22M
As you can see, the combined size is about the same while the wordcount is
slightly less.
Below is an improved script for merging wordlists. The last "bogoutil -d |
bogoutil -l" step is an optimization that significantly reduces the final
database size (from 32M to 22M for me). At the end of the run, it shows
word counts and database sizes.
Hope this helps.
David
#!/bin/sh
#
# merge
BOGODIR="/path/to/wordlists"
BOGOUTIL="/path/to/bogoutil"
cd $BOGODIR
rm -f wordlist.db
( $BOGOUTIL -d goodlist.db | awk '{ printf "%s 0 %d %d\n", $1, $2, $3}' |
tee good ; \
$BOGOUTIL -d spamlist.db | awk '{ printf "%s %d 0 %d\n", $1, $2, $3}' |
tee spam ) \
| sort | tee word | $BOGOUTIL -l wordlist.tmp
$BOGOUTIL -d wordlist.tmp | tee temp | $BOGOUTIL -l wordlist.db
ls -lh ????list.db ????list.tmp
wc -l good spam word temp
rm -f good spam word temp wordlist.tmp
More information about the bogofilter-dev
mailing list