Wanting a pre-db4 bogofilter

Matthias Andree matthias.andree at gmx.de
Fri Feb 25 10:50:19 CET 2005


Karl Schmidt <karl at xtronics.com> writes:

> Matthias Andree wrote:
>
>> MySQL has several, but none of this has any meaning to bogofilter,
>> which does not use MySQL.
>>
> I'm sorry, perhaps I haven't been clear enough - I'm not suggesting
> bogofilter use mysql -- that would be a "bad-thing"(TM). What I'm saying
> is mysql can use several different database storage engines. And yes,
> mysql can use Bdb just like bogofilter does, but there are reasons to
> look at the different ones, as one size does not fit all. Thus a look at
> the trade-offs of the different storage engines used in mysql could
> enlighten the choice for bogofilter including just the limited choice
> between Bdb3 and Bdb4.

There is no limited choice between Bdb3 and Bdb4, in fact, the changes
between Berkeley DB 3 and 4 are so minor it doesn't matter. Berkeley DB
3.2, 3.3, 4.0, 4.1, 4.2, 4.3 will work the same for bogofilter, where
versions since 4.1 will be a bit more robust against system crashes.

> My question still is does bogofilter need the added complexity of
> Bdb4?

No-one wants the added complexity, but we want the crash proofing. There
are consistently claims of bogofilter performance decreasing, and
without transactions, we can never say was it a crash, an abort, that
trashed the database, has spam changed, what else is the reason.

> Is the advantage a practical advantage or a theoretical advantage? Right
> now I have 185 1M log files in my db directory and it looks like I will
> have to set up yet another cron job to deal with it - this just seems to
> me that it isn't keeping with the KISS philosophy.

The next bogofilter release (I'm still trying to convince David we
should call it 0.93.6, but it might end up as 0.94.0) will automatically
remove most of these log files when the program is closing the database
by default (it can be configured not to do that, --db-log-autoremove=no).

>> exhibit A-C-I-D properties, as does Berkeley DB with transactional
>> datastore mode. Berkeley DB without transactional mode, QDBM and TDB
>> only exhibit the I (isolation) property.
>>
> Yes, and if I understand bogofilter (not that I do completely) could
> still work without any of the A-C-I-D (but probably could use the "D"
> which is provided by a Journaled file system (I think?))?

Are you actually reading my messages?

I wrote, more than once, that journalled file systems journal directory
changes, *not* file content changes!

Hence, journalled file systems guarantee nothing except making on-disk
directory and inode structures consistent. I don't know about Solaris,
but FreeBSD and Linux are *still* incapable of /enforcing/ a proper
write order for their journals or softdep filesystems with write caches
enabled, so you can also trash your journalled file system unnoticed...
And write caches enabled is the common case unfortunately.

At any rate, the guarantees journalled file systems make are limited to
the file system internal structures and have *zero* relevance for
bogofilter.  In the *best case*, a journalled file system can guarantee
that Berkeley DB will detect corruption and run database recovery.
In the *common case*, a journalled file system will just cause a quicker
boot with the same level of corruption as an unjournalled file system
might have caused.

A journalled file system does NOT help us with keeping wordlist.db
consistent, durable, whatever.

> Have you looked at storage engines outside of Bdb?

Yes, I have. SQLite3 is now supported by bogofilter, also with ACID
properties (again, provided that your disk drive's write cache is OFF),
it stores a simple file, leaving a second (journal) file behind in a
crash that is automatically recovered next time the database is opened.

There are also QDBM (by Stefan Bellon) and TDB (by Gyepi Sam) drivers
but the underlying databases are poorly documented, and TDB is known to
be very slow and also unmaintained.

> Were there any performance hits with going from Bdb3 to Bdb4?

There is no "going from Bdb3 to Bdb4".

There is "going from Berkeley DB Data Store to Berkeley DB Transactional
Data Store". We have no standard benchmark suite shipping as part of
bogofilter yet, but the write speed difference is, on my PC, negligible.

>> bogofilter/bogoutil play dumb and the whole bogofilter or bogoutil
>> process uses one huge transaction for almost all changes.  The
>> Atomicity trait is helpful as we need to change either all of the
>> .MSG_COUNT token and the individual tokens or none, for accuracy.
>
> So you does bogofilter really need Atomicity or is it just "helpful"?

That depends. On larger trainings (training on folders that have more
than just very few messages), it is mandatory, for an *occasional*
off-by-one on a fully trained database, it is a drop in the ocean.
Recurring off-by-one also call for atomicity.

> My points here I hope are taken kindly. Any Linux user is in danger of
> being a "complexity-junky"(TM) and I know I'm one myself.

Some of this is caused by Linux itself being so sloppy and
poorly-documented unfortunately.

> I have seen several software projects actually lose ground as time
> went by as the itch to make the project scope ever wider or use the
> latest version of a compiler ended up driving off the users (or
> customers in the commercial world). I don't want that to happen to
> bogofilter.

I understand that, and I have considered even more of this when adding
the SQLite 3 driver, because SQLite 3 isn't yet spread widely in
distributions, but SQLite versions older than 3.0.8 don't offer
everything bogofilter needs.

> The bottom line is I like bogofilter - I just don't see the advantage of
> moving Bdb3>4 that outweighs the problem of dealing with these log
> files.

I am looking forward to the next bogofilter current release, as this
will deal with the log files. Just updating to 0.93.6 will remove the
inactive logfiles and leave only one or a few behind.

> I'm hoping that looking at alternative storage engines might give
> bogofilter a way out?

Tuning Berkeley DB may already help so that users don't need to migrate
all their data to different storage formats.

-- 
Matthias Andree
_______________________________________________
Bogofilter mailing list
Bogofilter at bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter



More information about the Bogofilter mailing list