.db rebuilds: comparing versions, and a note on formail

Greg Louis glouis at dynamicro.on.ca
Fri Jan 31 14:02:53 CET 2003


Matthias isn't having the same db troubles as I am, according to a
recent posting, so I thought it might help if I gave some details:

Just finished rebuilding my spamlist.db with 0.10.1.4:

# time ./bogofilter -v -s -d /root/scratch </root/.bogofilter/spam_corpus 
# 5868782 words, 14502 messages

real    13m24.497s
user    0m55.840s
sys     0m17.420s

With formail and 0.10.1.1 that took just over four hours, which is one
reason I kinda like being able to register a whole mbox with one
bogofilter run.

Comparing apples with apples, 0.8.0 ran the above command in about
seven minutes, and 0.10.1.1 took about 25.  In each case bogofilter
had been run previously, so the executable was buffered in memory, but
the spam_corpus file had not been opened since the last reboot (this is
being done on a notebook that gets rebooted at least once daily).  The
machine is UP at 1.1GHz and has 512Mb of RAM, but the HD is a rather
slow IDE.

Notice, however, that the first 6400 nonspams into an empty list don't
take long at all, even though the word count is about 55% of that
processed in the spam job:

# time cat /store/mail/backup/*seen* | ./bogofilter -v -n -d /root/scratch
# 3149239 words, 6366 messages

real    0m38.830s
user    0m21.900s
sys     0m2.450s

Turns out that the token counts are quite different: the spamlist has
517529 tokens and the goodlist has 94519.  Still, this shows my db
slowing down rather drastically as it gets up around 500000 tokens (the
0.8.0 spamlist, without mime processing, has 331,741 tokens).

Another 3530 messages for the goodlist are on another machine over
nfs, and even they only take about 2 minutes to register.

I think it's the token count difference that matters.  The final token
count in the goodlist is 174421.  I've got a quantum leap in
registration time somewhere between three-and-a-half and five million
tokens, both with db-3.3.17 and with db-4.1.25.  It shows up in
classification too; messages of quite moderate size can take two or
more seconds each to process, while 0.8.0 gets them done in a couple
hundred milliseconds.

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |
| Help free our mailboxes. Include                   |
|        http://wecanstopspam.org in your signature. |




More information about the bogofilter-dev mailing list