bogotune broken? Larger data* revision committed to CVS.

Matthias Andree matthias.andree at gmx.de
Sat Nov 20 05:42:34 CET 2004


Hi,

I have committed the TXN cleanup to CVS now. Before the commit, I laid
down the tag "before-large-txn-fix" so we have a known good starting
point.

It required some major API shuffling, and I'm not yet sure if I'm happy
with it the way it is now.

For some reasons, long-standing processes but active processes are
considered crashed and abort. I'm not sure at this time if db_lock.c is
looking at the right file.

Also, bogofilter -uMB ~/Mail/lk -vvvxd prints the stats and the
X-Bogosity line, then crashes in rstats_print/rstats.c somewhere around
line 132 (no exact figure at hand, gcc -O3 here), dereferencing address
0x0 for a 4-byte read (one processor word). Not sure yet why but makes
-xd useless for now.

Backtrace:

#0  rstats_print (unsure=false) at ../../src/rstats.c:132
#1  0x0805508e in write_message (status=RC_UNSURE)
    at ../../src/passthrough.c:272
#2  0x08049f3b in bogofilter (argc=0, argv=0x0) at ../../src/bogofilter.c:127
#3  0x0804c2ef in bogomain (argc=1, argv=0xbffff0b0) at ../../src/bogomain.c:62
#4  0x0804a1ae in main (argc=0, argv=0x0) at ../../src/main.c:25


Major annoyances during the reshuffling of the code were, in random
order:

- multiple wordlists - this stuff is a flawed concept.

  Multiple wordlists cannot work system-wide anyhow because the reader
  must either accept dirty and sometimes failing reads or be able to
  lock the list - which opens up a can of denial-of-service worms.

  I'd like to kill this code without replacement. I don't know if it
  works beyond the test suite, which is incomplete by nature.

  Ignore lists that are lying in the same directory are fine though, the
  precondition is "same BDB environment".

- global variables, such as dbe, bogohome, word_lists, message counts,
  robs, robx - the references are hard to chase, and there is no
  technical reason to use these. (dbe is now gone, message counts
  halfway).

- forward declarations of static functions. Maintenance headache because
  things need to be changed in two places without good reason. These are
  only justified for one in a pair of mutually recursive functions.

- atexit() handlers - they always run when they aren't supposed to,
  causing bogus errors, wasting maintainer time.

The multiple-wordlist code now has an environment cache hooked up, for
every "wordlist" that is opened, we check our environments cache if we
have an environment for the dirname of the wordlist, and if we don't, we
open one and also open a transaction right away, and read the message
counts for the wordlist. close_wordlists() is responsible for closing
the transactions, databases, and environments.

make check passes 38/39 tests, not counting the bogus pass below and not
counting the crash for long-running processes above.

bogotune is broken. It causes database panics for a reason I cannot
see. What's worse, it doesn't propagate these in a way that would let
t.bulkmode "FAIL".

Anyways, please check scoring speed again, the fact that one score = one
transaction should speed things up considerably as we no longer have
one transaction (synchronous write) per token.

Scoring my Linux-Kernel folder in -MTB mode (mbox format, with warm
cache) took 30 s before the update and 6 s after the update, wallclock
time - and that's with a pretty fast CPU (AMD Athlon XP 2500+) and a
medium class 7200/min SCSI drive (Fujitsu MAH-3182MP). I'd think
machines with slower CPU or slower HDD will profit even more.

BTW, olddb takes the same time, also 6 s. No more performance advantage
without transactions for scoring. And no, I'm not going to benchmark -u
- it is the main cause for fast (almost explosively) growing logs and
slower operation.

-- 
Matthias Andree



More information about the bogofilter-dev mailing list