bogotune - C vs. perl

David Relson relson at osagesoftware.com
Tue Dec 23 00:20:51 CET 2003


On Mon, 22 Dec 2003 10:46:38 -0800
Jef Poskanzer <jef at acme.com> wrote:

> Any idea why the C version uses more memory?  Maybe that could be
> fixed.

Yes.  For improved performance.  Significant work has been done on
memory usage and it's well nigh minimal.  Actual usage depends on the
mix of input files.  There's a lot of info in bogotune-faq.html -- read
it!  I'll describe some of what happens:

During tuning, each message is scored approx 400 or 500 times.  For
efficiency, bogotune stores the message in ram.  While scoring, all
that's needed for each message is the ham and spam counts for its
tokens.  Since the actual tokens (think text strings) are not needed,
bogotune stores each message as an array of uint32 values (one uint32
for each token's ham count and one for its spam count).  When scoring a
large number of messages (with a large number of tokens), the memory
usage mounts up.  Considering what's involved, bogotune uses a minimal
amount of memory during scoring.

The highest memory usage will occur during the input stage.  If the
input messages are all in message count format, as they're read they're
converted to an array of uint32s.  If the inputs are normal text, they
need to be parsed and the ham and spam counts need to be looked up.  To
speed this process, bogotune caches the wordlist in memory.  This is
where memory usage will be highest.  If the need for memory causes
swapping, performance goes way down.  

Bogotune could be modified to let the database system do the needed
caching.  This would trade higher disk usage for lower memory usage.  If
there's enough memory to cache the complete wordlist, this would
beslower than the current technique.

AFAICT, bogotune makes efficient use of the memory it uses.

Hopefully this clarifies matters for you.

David




More information about the Bogofilter mailing list