bogotune broken? Larger data* revision committed to CVS.

David Relson relson at osagesoftware.com
Sat Nov 20 14:39:42 CET 2004


On Sat, 20 Nov 2004 05:42:34 +0100
Matthias Andree wrote:

> Hi,
> 
> I have committed the TXN cleanup to CVS now. Before the commit, I laid
> down the tag "before-large-txn-fix" so we have a known good starting
> point.

You've been busy!  The speed results sound very encouraging.  Good job!

I'll take a look at the mailbox scoring problem (crashing in rstats.c)
and at the bogotune problem with creating msg-count files.  Hopefully,
it's minor and not a major breakage of bogotune.

...[snip]...

> Major annoyances during the reshuffling of the code were, in random
> order:
> 
> - multiple wordlists - this stuff is a flawed concept.
> 
>   Multiple wordlists cannot work system-wide anyhow because the reader
>   must either accept dirty and sometimes failing reads or be able to
>   lock the list - which opens up a can of denial-of-service worms.
> 
>   I'd like to kill this code without replacement. I don't know if it
>   works beyond the test suite, which is incomplete by nature.
> 
>   Ignore lists that are lying in the same directory are fine though,
>   the precondition is "same BDB environment".

I maintain that multiple wordlists are of value.  Requiring the "same
BDB environment" doesn't affect me personally.  However, one of the
goals of multiple wordlists is to enable "user" and "system" lists,
which _can't_ be in the same directory (though I suppose that a
common environment would be possible).


> - global variables, such as dbe, bogohome, word_lists, message counts,
>   robs, robx - the references are hard to chase, and there is no
>   technical reason to use these. (dbe is now gone, message counts
>   halfway).

While there are well-known issues with global variables, moving them all
into parameters (or a structure that is passed to everyone) isn't a
whole lot better.  AFAICT it just changes the appearance of the code,
not the maintainability.

As the globals tend to have primary users, one idea is to declare them
in the appropriate file and implement accessor functions.  It might be
better, or it might not be.

> - forward declarations of static functions. Maintenance headache
> because
>   things need to be changed in two places without good reason. These
>   are only justified for one in a pair of mutually recursive
>   functions.

As you say, declaring them isn't a necessity (except for the one case). 
Not declaring them puts restriction on their placement in the
source code.  Declaring them gives a quick summary of what's in
the file and allows them to be ordered rationally, i.e.
alphabetically.

> - atexit() handlers - they always run when they aren't supposed to,
>   causing bogus errors, wasting maintainer time.

The handler was needed to ensure the database gets closed.  If that
assurance is no longer necessary, it can be dropped.

...[snip]...

> BTW, olddb takes the same time, also 6 s. No more performance
> advantage without transactions for scoring. And no, I'm not going to
> benchmark -u- it is the main cause for fast (almost explosively)
> growing logs and slower operation.

There are those of us who use 'u' and find it valuable.  Given all the
hard work you've done to enable concurrent reads and writes, I'd think
you'd like this feature since it exercises the new locking
capability.

With luck, I'll be able to look at some of these issues this
weekend.  Unfortunately, I took some time off work last week and
have some code that needs to work for Monday.

Regards,

David



More information about the bogofilter-dev mailing list