switching between different databases - in 1.3.0.rc1
Rob McEwen
rob at invaluement.com
Fri May 23 15:47:54 CEST 2025
Matthias,
>What does "slower" mean? How do you measure? What's the Debian version and what are the database versions? Do you compare Debian VM with 1.3.0 vs. bogofilter 1.2.5 on a real machine without VM in between? What's the underlying filesystem for both? Configured how exactly?
So first, as I had already hinted at, this was just an initual cursory
review was very far from being perfect.
But here is that setup:
Both are VMware Workstation Pro Debian virtual machines running inside a
Windows 2019 server
(1) The Bogofilter 1.2.5 instance is using Berkley DB 5.3.28 and running
on Debian 10
(2) The Bogofilter 1.3.0.rc1 instance is using LMDB 0.9.24 and running
on Debian 12
So both are VMs running inside this same Windows 2019 server. The
Windows server has a very very high end RAID 10 with multiple very fast
Enterprise SSD drives. It's a very fast server with 128 GB of RAM. The
two VMware instances were accessing emails inside directories in the
"parent" host - so while there are things happening that could be adding
additional latency and overhead, it was mostly an apples-to-apples
comparision, unless one OS being Debian 12 made things slower than other
using Debian 10? So the issue wasn't the speed, it was the relative
speed DIFFERENCE between the two, where the 1.2.5 using Berkely was just
a tiny bit faster than 1.3.0.rc1 instance using LMDB 0.9.24 - when I was
expecting the opposite. Also, maybe "speed"is the wrong term? I'm
referring to how long these processes took to complete end-to-end.
But in my more formal testing that I'll do next, all of this will run on
the same VM, using 1.3.0.rc1, to minimize the differences.
But for those cursory tests, I just simply used the following command:
time bogofilter -t < /path-to-email
So I compared several emails on both systems, but doing that command one
at a time, to a variety of emails, and ignoring the first couple of
tries on each system, so that there wouldn't be any bias based on the
message not yet being in either systems' cache. So as shown, I was
measuring this using the built-in "time" function in debian, and then
just running individual scans on various individual emails.
>Note to be comparing in a fair manner, Bogofilter DB should be used in "transactional mode" so as to be made robust (recoverable) against crashes because the other databases you're looking at should do just that: transactions. SQLite3 certainly does so
I didn't consider that. Good point. I'll do that. But for those in
situations where Bogofilter is ONLY doing reads, and no actual
updates/writes to the data is happening during production usage -
including only reads done during the cursory tests that just describe -
wouldn't this not matter? Or are transactions involved even when it's
just reading from the db?
NOTE: What's also interesting here is that my system does extremely rare
training of messages on production systems. I know that's not how most
email systems using Bogoflter do it. But I've found great success in
building the data on a separate non-production VM, then copying the
updated wordlist.db to the production server. And in those rare
instances where training on the procution system is needed, it's always
a "one-writer" situation. And so (unlike most systems?) I don't really
have enough of a need for the use of transactions to prevent a
corruption - to prefer that over performance gains. So I suspect that
sticking with Berkeley with "db_transaction=no" might actually continue
to be the best setup for my situation?
Also, "db_transaction=no" is still supported for Berkeley in the new
version, correct?
So as you finish up this new vesrion - please keep in mind that there
are probably numerous others with similar preferences/situations - not
needing transactional safety because they're either not doing any writes
on production systems - or in other cases they have a one-writer
situation with the data sufficiently backed up.
>I don't fully understand your question.
You didn't understand my question about LMDB - because it was a dumb
question - caused by me not understanding LMDB. I had mistakenly thought
that it's ability to persist data in memory - involved it running as a
service/daemon, or something like that. Sorry about that!
>Should you decide to do anything of profiling/performance metrics and you identify hot spots or I/O slowdowns somewhere, please share your findings.
Will do!
>
>BTW, the plan is to fix that dash or underscore bug in the configuration/documentation which means I need to look at most of bogofilter, and then do 1.3.0.rc2 with that fix. <https://gitlab.com/bogofilter/bogofilter/-/issues/15>
Exellent! Thanks for all you do!
Rob McEwen, invaluement
More information about the bogofilter
mailing list