switching between different databases - in 1.3.0.rc1

Fri May 23 15:47:54 CEST 2025

Matthias,

>What does "slower" mean? How do you measure? What's the Debian version and what are the database versions? Do you compare Debian VM with 1.3.0 vs. bogofilter 1.2.5 on a real machine without VM in between? What's the underlying filesystem for both? Configured how exactly?

So first, as I had already hinted at, this was just an initual cursory 
review was very far from being perfect.

But here is that setup:

Both are VMware Workstation Pro Debian virtual machines running inside a 
Windows 2019 server

(1) The Bogofilter 1.2.5 instance is using Berkley DB 5.3.28 and running 
on Debian 10

(2) The Bogofilter 1.3.0.rc1 instance is using LMDB 0.9.24 and  running 
on Debian 12

So both are VMs running inside this same Windows 2019 server. The 
Windows server has a very very high end RAID 10 with multiple very fast 
Enterprise SSD drives. It's a very fast server with 128 GB of RAM. The 
two VMware instances were accessing emails inside directories in the 
"parent" host - so while there are things happening that could be adding 
additional latency and overhead, it was mostly an apples-to-apples 
comparision, unless one OS being Debian 12 made things slower than other 
using Debian 10? So the issue wasn't the speed, it was the relative 
speed DIFFERENCE between the two, where the 1.2.5 using Berkely was just 
a tiny bit faster than 1.3.0.rc1 instance using LMDB 0.9.24 - when I was 
expecting the opposite. Also, maybe "speed"is the wrong term? I'm 
referring to how long these processes took to complete end-to-end.

But in my more formal testing that I'll do next, all of this will run on 
the same VM, using 1.3.0.rc1, to minimize the differences.

But for those cursory tests, I just simply used the following command:

time bogofilter -t < /path-to-email

So I compared several emails on both systems, but doing that command one 
at a time, to a variety of emails, and ignoring the first couple of 
tries on each system, so that there wouldn't be any bias based on the 
message not yet being in either systems' cache. So as shown, I was 
measuring this using the built-in "time" function in debian, and then 
just running individual scans on various individual emails.

>Note to be comparing in a fair manner, Bogofilter DB should be used in "transactional mode" so as to be made robust (recoverable) against crashes because the other databases you're looking at should do just that: transactions. SQLite3 certainly does so

I didn't consider that. Good point. I'll do that. But for those in 
situations where Bogofilter is ONLY doing reads, and no actual 
updates/writes to the data is happening during production usage - 
including only reads done during the cursory tests that just describe - 
wouldn't this not matter? Or are transactions involved even when it's 
just reading from the db?

NOTE: What's also interesting here is that my system does extremely rare 
training of messages on production systems. I know that's not how most 
email systems using Bogoflter do it. But I've found great success in 
building the data on a separate non-production VM, then copying the 
updated wordlist.db to the production server. And in those rare 
instances where training on the procution system is needed, it's always 
a "one-writer" situation. And so (unlike most systems?) I don't really 
have enough of a need for the use of transactions to prevent a 
corruption - to prefer that over performance gains. So I suspect that 
sticking with Berkeley with "db_transaction=no" might actually continue 
to be the best setup for my situation?

Also, "db_transaction=no" is still supported for Berkeley in the new 
version, correct?

So as you finish up this new vesrion - please keep in mind that there 
are probably numerous others with similar preferences/situations - not 
needing transactional safety because they're either not doing any writes 
on production systems - or in other cases they have a one-writer 
situation with the data sufficiently backed up.

>I don't fully understand your question.

You didn't understand my question about LMDB - because it was a dumb 
question - caused by me not understanding LMDB. I had mistakenly thought 
that it's ability to persist data in memory - involved it running as a 
service/daemon, or something like that. Sorry about that!

>Should you decide to do anything of profiling/performance metrics and you identify hot spots or I/O slowdowns somewhere, please share your findings.

Will do!

>
>BTW, the plan is to fix that dash or underscore bug in the configuration/documentation which means I need to look at most of bogofilter, and then do 1.3.0.rc2 with that fix. <https://gitlab.com/bogofilter/bogofilter/-/issues/15>

Exellent! Thanks for all you do!

Rob McEwen, invaluement