hash table to replace Judy

Wed Sep 18 05:58:39 CEST 2002

Hmmm, before bogofilter came out, I wrote my own implementation of Paul
Graham's idea, except I did my own word dictionary/hash table/search
tree.  I was originally frustrated that bogofilter required the Judy
library (and more frustrated when dependence on BerkeleyDB came in).  I
figured that it would be better to just use Judy and forget about
proprietary implementations.  Since we are talking about using new code
for that job, I thought I would throw in my own attempt.  I won't
include it in this message, but if you would like to see it, I will be
happy to send it to you or post it to this list.  What are some other
opinions about this?  Stick with Judy or find something new to eliminate
the dependency?

Gyepi: I would be interested in trying my implementation with your
benchmarking suite.  Let me know how to arrange this.

Doug Beardsley

On Tue, Sep 17, 2002 at 10:55:34PM -0400, Gyepi SAM wrote:
> On Tue, Sep 17, 2002 at 09:50:47PM -0400, David Relson wrote:
> 
> > Sounds interesting.  What were the sizes of your dictionaries and messages 
> > and what sort of performance figure did you see?
> 
> I used 5 files of various properties. Here's the output from wc
> 
>       0    1925    9840 text-09.txt
>    5000    5000   25000 text-10.txt
>    5000    5000   48893 text-20.txt
>    1427   13382   80538 text-40.txt
>    4490   44278  249434 text-50.txt
> 
> text-09.txt contains chapters 21-23 of Revelations. It is a single long line due to preprocessing to remove html.
> text-10.txt contains the same word repeated 5000 times,
> text-20.txt contains 5000 unique words.
> text-40.txt is the gutenberg text of 'The Legend of Sleepy Hollow' by Washington Irving
> text-50.txt is the gutenberg text of 'Idle Thoughts of an Idle Fellow' by Jerome K. Jerome
> 
> I timed the execution time to for each application to process
> each file for 50 iterations and post-processed the results to compute
> the average.
> 
> Processing consists of reading the input, counting word frequencies, then 
> looping over the results as if to print to stdout. I disabled the actual printing
> to stdout; I was not interested in that.
> 
> As the output shows, the hash implementation is, on average, faster than Judy.
> I'd be happy to make the benchmarking suite available if desired, or to modify the
> tests if these are not appropriate. The fields are: filename,program,iterations, stats.
> 
> text-09.txt hash 50     min:0.01 max:0.02 avg:0.0102
> text-09.txt judy 50     min:0.01 max:0.02 avg:0.011
> text-10.txt hash 50     min:0.01 max:0.02 avg:0.014
> text-10.txt judy 50     min:0.01 max:0.02 avg:0.0122
> text-20.txt hash 50     min:0.02 max:0.03 avg:0.0212
> text-20.txt judy 50     min:0.03 max:0.04 avg:0.034
> text-40.txt hash 50     min:0.03 max:0.04 avg:0.0362
> text-40.txt judy 50     min:0.06 max:0.06 avg:0.06
> text-50.txt hash 50     min:0.1 max:0.11 avg:0.1024
> text-50.txt judy 50     min:0.17 max:0.18 avg:0.1718
> 
> -Gyepi
> 
> -- 
> learning without thinking is labor lost; thinking without learning is dangerous -- Chinese Proverb