switching between different databases - in 1.3.0.rc1
Matthias Andree
matthias.andree at gmx.de
Tue Jun 3 22:02:45 CEST 2025
Am 23.05.25 um 15:47 schrieb Rob McEwen via bogofilter:
> Matthias,
>
>
>> What does "slower" mean? How do you measure? What's the Debian
>> version and what are the database versions? Do you compare Debian VM
>> with 1.3.0 vs. bogofilter 1.2.5 on a real machine without VM in
>> between? What's the underlying filesystem for both? Configured how
>> exactly?
>
> So first, as I had already hinted at, this was just an initual cursory
> review was very far from being perfect.
>
> But here is that setup:
>
> Both are VMware Workstation Pro Debian virtual machines running inside
> a Windows 2019 server
>
> (1) The Bogofilter 1.2.5 instance is using Berkley DB 5.3.28 and
> running on Debian 10
>
> (2) The Bogofilter 1.3.0.rc1 instance is using LMDB 0.9.24 and running
> on Debian 12
>
> So both are VMs running inside this same Windows 2019 server. The
> Windows server has a very very high end RAID 10 with multiple very
> fast Enterprise SSD drives. It's a very fast server with 128 GB of
> RAM. The two VMware instances were accessing emails inside directories
> in the "parent" host - so while there are things happening that could
> be adding additional latency and overhead, it was mostly an
> apples-to-apples comparision, unless one OS being Debian 12 made
> things slower than other using Debian 10? So the issue wasn't the
> speed, it was the relative speed DIFFERENCE between the two, where the
> 1.2.5 using Berkely was just a tiny bit faster than 1.3.0.rc1 instance
> using LMDB 0.9.24 - when I was expecting the opposite. Also, maybe
> "speed"is the wrong term? I'm referring to how long these processes
> took to complete end-to-end.
>
> But in my more formal testing that I'll do next, all of this will run
> on the same VM, using 1.3.0.rc1, to minimize the differences.
>
> But for those cursory tests, I just simply used the following command:
>
> time bogofilter -t < /path-to-email
>
> So I compared several emails on both systems, but doing that command
> one at a time, to a variety of emails, and ignoring the first couple
> of tries on each system, so that there wouldn't be any bias based on
> the message not yet being in either systems' cache. So as shown, I was
> measuring this using the built-in "time" function in debian, and then
> just running individual scans on various individual emails.
>
>> Note to be comparing in a fair manner, Bogofilter DB should be used
>> in "transactional mode" so as to be made robust (recoverable) against
>> crashes because the other databases you're looking at should do just
>> that: transactions. SQLite3 certainly does so
>
> I didn't consider that. Good point. I'll do that. But for those in
> situations where Bogofilter is ONLY doing reads, and no actual
> updates/writes to the data is happening during production usage -
> including only reads done during the cursory tests that just describe
> - wouldn't this not matter? Or are transactions involved even when
> it's just reading from the db?
Yes, at least a way more extensive locking is because it needs to make
sure transactions are isolated but also, once committed, complete, so
competing changes of the same areas must be prevented.
>
> NOTE: What's also interesting here is that my system does extremely
> rare training of messages on production systems. I know that's not how
> most email systems using Bogoflter do it. But I've found great success
> in building the data on a separate non-production VM, then copying the
> updated wordlist.db to the production server. And in those rare
> instances where training on the procution system is needed, it's
> always a "one-writer" situation. And so (unlike most systems?) I don't
> really have enough of a need for the use of transactions to prevent a
> corruption - to prefer that over performance gains. So I suspect that
> sticking with Berkeley with "db_transaction=no" might actually
> continue to be the best setup for my situation?
It might, but then again, Berkeley DB 5.3 is quite old, 12 years. It's
the last version under the liberal license though. If it's read only on
the production server, it's good enough though because it doesn't
corrupt itself while being read unless the hardware gives up the ghost.
> Also, "db_transaction=no" is still supported for Berkeley in the new
> version, correct?
I don't remember removing that.
>
> So as you finish up this new vesrion - please keep in mind that there
> are probably numerous others with similar preferences/situations - not
> needing transactional safety because they're either not doing any
> writes on production systems - or in other cases they have a
> one-writer situation with the data sufficiently backed up.
And all the other good stuff such as battery-backed write caches on your
RAID and all that... and still if while writing the application crashes
or the kernel panics, you end up recovering from backup. Which may be
good enough for many. The thing is that with Berkeley DB, many folks
going without transactions over the many years of bogofilter's use,
figured the hard way that corruption happens silently and can cause
unterminated loops when reading databases.
Reply to your later message will be sent separately.
More information about the bogofilter
mailing list