[PATCH] combined wordlist a.k.a. single list

Malcolm Dew-Jones yf110 at victoria.tc.ca
Mon Jun 2 06:31:55 CEST 2003



My question/observation is this. 


The size of the files in the new versions are what is my biggest concern. 

Will the combining of the files make them smaller? 


If the spam/ham files share a lot of words them the combined file should
be smaller, but for us, one of the lists (I think the ham words, but can't
check right now) is much larger, so in this case a combined file might be
much bigger than a single file. 


The encoding of the number could make a difference.  If the database used
a flag to indicate the meaning of the number then this might not be an
issue.  For example, if the first bit in the first byte of the first
number used a bit to indicate whether it was a ham or spam count, and the
length of the data after the word was used to implicitly indicate the
number of counts, then singleton words would only take up the same space
as they do now in a single file (though the maximum word count would be
reduced by a factor of two, though that probably makes no practical
difference). 

$0.02





More information about the bogofilter-dev mailing list