bogofilter speed [was: garbage removal]
David Relson
relson at osagesoftware.com
Fri May 9 19:11:47 CEST 2003
Marek,
Nice job of testing and graphing. I might have labeled the horizontal axis
in KB/MB, i.e. 500K, 5.9M, 11.7M, 17.5M, 23.4M (or whatever).
There's another factor you might want to test and that's BerkeleyDB's cache
size. We've found that the cache size can help performance. You might
find it interesting to increase the cache size and see if it affects your
performance. Code like the following would do it:
CFG="test.cfg"
for size in 0 2 4 8 16 ; do
cat <<Eof >$CFG
db_cachesize=$size
Eof
bogofilter -c $CFG ...
done
Greg Louis and I have been doing some testing to see what happens when the
two wordlists (spamlist.db and goodlist.db) are combined in one wordlist,
i.e. wordlist.db. Using BerkeleyDB's default cache (256k) works poorly
with a combined wordlist. With a sufficiently large cache, the combined
wordlist outperforms the separate wordlists. For example on my PIII-500
scoring 17,083 messages in a 90MB mbox file takes about 168 seconds with
separate wordlists. The combined wordlist takes 280 seconds with default
cache size, but that drops to 150 seconds with a 6MB cache, and to 144 with
a 10MB cache.
Like I said, it'd be interesting if you'd test how cache size affects your
performance.
David
More information about the Bogofilter
mailing list