throughput

Greg Louis glouis at dynamicro.on.ca
Sun Jan 26 14:20:52 CET 2003


On 20030125 (Sat) at 2018:26 -0500, David Relson wrote:

>> We _can't_ have new emails take a half second to process on a 1.2GHz
>> machine.  I will definitely try to contribute in this area; been
>> quite a while since I was involved in performance tuning, but maybe
>> I can bring back some old memories...

> I just ran 20030105.mbx through "bogofilter -v -s" with an empty 
> wordlist.  2.8MB giving 42,499 words, and 114 messages.  1.97s user, 0.02s 
> system, 0:02.10 elapsed, 94%CPU (on my P-III 500mhz, 256Mb ram, SCSI 
> hd).  That's roughly 50 msgs/sec or 2ms/msg.
> 
> Sounds like something's very different where you are.
 
Processing for db registration takes me a while but not an
unconscionably long while.  I consume no more than 8% of cpu (the hard
disk on this laptop ain't the quickest, and registration is definitely
IO-bound).  It's a PIII Mobile, 1.1 (not 1.2)GHz, 512Mb RAM, IDE hard
disk (the interrupt-unmask flag is set):

# time ./bogofilter -d /root/scratch -v -s </store/spam_corpus
# 5811580 words, 14309 messages

real    13m13.070s
user    0m54.110s
sys     0m17.260s

In real time that's 55ms per message, which is acceptable though
not blindingly quick.  The average, however, is misleading: as you'd
expect, the time per message starts very short and grows, nonlinearly,
as the size of the database increases:

# time cat /store/mail/backup/*seen* | ./bogofilter -d /root/scratch -v -n
# 3148200 words, 6366 messages

real    0m31.031s
user    0m21.190s
sys     0m2.580s

The average for these 6366 messages was only 4.8ms.  Both runs were
into previously empty wordlists.

I added the rest of the nonspams in the training corpus (stored on
another machine and fetched over nfs, so the timing isn't relevant)
and now comes the acid test.  On an otherwise quiescent machine that
hasn't seen these emails before (so they're not buffered in memory), I
run the test20 mbox, containing 20 spam followed by 20 rather
spammy-looking nonspam:

# time formail -s ./bogofilter -d /root/scratch -v </root/test20
1 3.43e-10
1 2.58e-11
1 5.55e-16
1 0.00e+00
? 0.498577
? 0.917960
? 0.972464
1 6.25e-08
1 1.66e-05
1 1.22e-14
1 0.00e+00
1 0.00e+00
1 6.97e-12
1 3.10e-13
1 1.66e-04
1 1.11e-07
1 0.00e+00
1 4.73e-08
1 0.00e+00
1 1.73e-07
0 9.80e-02
? 0.172396
1 3.13e-04
? 0.500000
? 0.499992
0 6.06e-02
? 0.501437
0 1.76e-04
0 2.25e-02
? 0.683803
? 0.410621
0 6.99e-03
0 6.61e-03
? 0.560911
? 0.497964
0 1.23e-06
0 3.68e-11
? 0.656241
? 0.845151
? 0.426761

real    0m20.044s
user    0m0.410s
sys     0m0.320s

Half a second each, on the average.  Three unsure spams, one false
positive :( and 11 unsure nonspams.  Tuning is essential, but the time
is the killer.  If I do the same thing with bogofilter-0.8.0 (after
a reboot so the data aren't in memory):

# time formail -s bogofilter -v </root/test20
1 0.00e+00
1 0.00e+00
1 0.00e+00
1 7.18e-14
1 2.09e-07
1 8.69e-10
1 9.53e-06
1 7.87e-03
1 8.14e-05
1 9.72e-03
1 0.00e+00
1 0.00e+00
1 0.00e+00
1 4.01e-10
1 8.07e-03
1 7.54e-08
1 1.05e-04
1 2.22e-16
1 0.00e+00
1 1.35e-04
? 5.21e-01
? 3.09e-01
? 9.51e-01
? 1.57e-01
? 4.73e-01
0 2.88e-02
? 6.65e-01
0 1.54e-03
0 2.29e-02
? 9.69e-01
0 1.65e-04
0 6.12e-02
0 1.17e-02
? 6.22e-01
? 4.97e-01
0 1.16e-06
0 1.56e-11
? 6.21e-01
? 5.38e-01
? 4.21e-01

real    0m7.325s
user    0m0.290s
sys     0m0.290s

In both cases, just loading bogofilter for the first time is taking
about five seconds.  If we knock 5 seconds off for that, the times per
message are 376ms and 58ms respectively.

With bogofilter in memory already, but not the messages to be
processed:

# time formail -s ./bogofilter -d /root/scratch -v </store/mail/csseen
...
real    0m22.191s
user    0m1.400s
sys     0m0.640s

# grep -c 'From ' /store/mail/csseen
117

That's 190ms per message.  The difference from 376 is presumably that
more of the db is in memory buffers as well.  (Do you know of a way,
other than rebooting, to mark all buffers as needing refresh?)

Again, old bogofilter is faster:

# time formail -s bogofilter -v <vspam
...
real    0m1.273s
user    0m0.250s
sys     0m0.220s

# grep -c '^From ' vspam
37

34.4ms per message.  If you consider that formail must be adding some
overhead, the difference is even more striking.  (This isn't an
apples-to-apples comparison; what I should have done is to reboot the
machine and run /store/mail/csseen again, because these 37 spams may
have an average length or token count different from that of those 117
nonspams.  But in the light of the earlier result with test20, it's
probably not that misleading.)

It may be that there's no particular merit in old bogofilter's db
access, but that the different db sizes make a big big difference in
access time.  Bogoutil takes more than twice as long to dump the new
spamlist as it does to dump the old one (when neither has been called
into memory before), even though the new one is only about 1.5 times
bigger in wordcount and 1.7 times bigger in bytes (bogoutil has been
run with a bogus db name so the binary's in memory when we start):

# time bogoutil -d .bogofilter/spamlist.db | wc -l
 331390

real    0m16.539s
user    0m1.100s
sys     0m0.090s

# time bogoutil -d scratch/spamlist.db | wc -l
 510264

real    0m35.002s
user    0m1.570s
sys     0m0.150s

The goodlist.db files are about the same size:

# bogoutil -d .bogofilter/goodlist.db | wc -l
 143315

# bogoutil -d scratch/goodlist.db | wc -l
 142125

Either there's a db scaleability problem, or some changes (better
locking?) done since 0.8.0 have degraded throughput rather
significantly.  Now mind you, I'm on db-3.1.17; should I try upgrading
to 4, do you think?

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |
| Help free our mailboxes. Include                   |
|        http://wecanstopspam.org in your signature. |




More information about the Bogofilter mailing list