Better database??

Wed Mar 3 16:55:04 CET 2004

On Wed, 03 Mar 2004, michael at optusnet.com.au wrote:

> What I was expecting was a layer that called a datastore saying
> "Here's a token, a spam count, a good count and a date. store them".
> and "Find me the counts and date associated with this token" and
> "Here's the number of messages; Store it". etc etc.

We're going away from that, for consistency reasons. We currently have
the problem that message count and token counts are totally unrelated to
each other - only that a design flaw in datastore_db (BerkeleyDB) has
hidden that consistency problem so far, because only one process can
write the data base at the moment and it will block all readers. Other
datastore layers may or may not expose this problem, I haven't looked
into that.

The goal is to keep the data base updates atomic, "Here's a number of
bogofilter data sets, consisting of token, ham count, spam count and
date, store all of them and increment .MSG_COUNT at the same time or do
nothing at all."

> The idea here is that bogofilter knows exactly what it wants to store,

Your assumption is that this information will remain the same over time.
I'm not convinced this assumption matches reality.

> and it would be nice if the datastore was told about it. This would
> allow the datastore to use some intelligence in storing it.

I agree that casting the data model into the code will allow for
optimizations, but these come at the price of flexibility,
maintainability and extensibility.

I'd be willing to accept a data abstraction layer, one that does the
framing/extraction and tells us HOW the data is stored on disk. If that
layer is designed with extensibility in mind, that may be an advantage
for further development. For instance, the date is just a kludge that
allows to purge tokens from the data base - however, the .MSG_COUNT is
only a step-child of all these maintenance functions at best. The whole
scoring and storing stuff is modeled around messages, while the token
maintentance is modeled around individual tokens without taking the
.MSG_COUNT into account at all. That's one of the most important reasons
why I have avoided the maintenance functions to date: I do not trust the
current maintenance code does the right thing.

> At the moment, the layer does it the other way around: It 
> presents an API to bogofilter and says "I'll store random
> things: Do with me what you will".
> 
> Does that make any sense??

Yes, it does. That way, the content stored can be changed without
touching the data base layer at all. We've already changed the data
store layer several times, for instance, introducing the modification
time of a token. I'm not using it...

> What I was looking to do was things like to use 16 bit counts for
> storing ham+spam counts; To not store the date; To store the date in
> an imprecise fashion; etc etc. Basically: To take advantage of the
> knowledge of what it is that we're trying to store.

The assumption we'll be able to store ham + spam counts in 16 bit
integers comes out of the "arbitrary limits that are coming back to
haunt us" camp. We're already running risks assuming that 32-bit
counters will suffice - which is a dangerous assumption.

The "date" member in a token record is not crucial to bogofilter.

-- 
Matthias Andree

Encrypt your mail: my GnuPG key ID is 0x052E7D95