program code [was: [cvs] bogofilter/src ...]

Mon Mar 14 06:01:00 CET 2005

On Mon, 14 Mar 2005 03:04:57 +0100
Matthias Andree wrote:

> David Relson <relson at osagesoftware.com> writes:
> 
> > There's nothing gratuitous about the strlen() call in word_new().  It's
> > part of the (unwritten) semantics of the function.  I'll be glad to
> > document the feature and make it official.
> 
> Rather not make the interface so wide, all this automation is killing
> us, and it isn't consistently used across the code either, in fact,
> there is ZERO user except the bogus macro left.

I've converted word_news to a function :-)  This eliminates the need
for extra parens in the macro (because there's no more macro!) and
reduces the number of strlen() calls when creating word_t structs to 1.

> > At the moment, rather than have more tweaks for word_t, I'd rather have
> > the new bogoutil capabilities so I can build/release 0.94.1
> 
> Wow! I've tried just to add the --db-checkpoint option, which turned out
> to be a major hassle.
> 
> I've needed to touch two files for the low-level implementation (quite
> easy), two unrelated files for the stubs, two files for the ds_*
> interface, then four files for the option, and spent half an hour
> debugging two bugs with recovery that warrant further analysis as to
> where these came from.

Yes, we have a lot of files.  We had a choice for implementing
databases and transaction vs non-transaction mode.  The choice was
between using #ifdefs and using multiple files.  We chose the latter
solution and the consequence is multiple versions of some functions in
multiple files.  

So be it!

> I must assess that we have
> 
> - simple code (option parsing, help, option definitions) spread over way
>   too many files

Yes these things are spread out.  We also build 4 major programs --
bogofilter, bogoutil,  bogolexer, and bogotune.  They share many
options, but each has some of its own.  The result is 4 sets of option
definitions, parsing routines, and help routines.  

So be it.

> - an absolutely inconcise and twisted database_db* implementation over
>   half a dozen files, partially from trying OOP in C which is just
>   painful as it cannot be articulated properly which left two groups of
>   other implementations behind, 1. sqlite, 2. qdbm + tdb

OOP _can_ be implemented in C, if one really wants to do so.  The early
C++ compilers were actually language translators and converted C++ code
to C code which was then compiled.  Consequently anything doable in C++
can be done in C, if one cares to make the effort.  AFAICT, we're
trying to use some OO constructs without including everything.  

Seems reasonable to me :-)

> - way too many #ifdef hacks all over the code

AFAICT, string "^if" appears 138 times (excluding lexer_v3.c) and three
most common symbols in these 138 statements are:

   CP866    15
   DB       49
   TEST      8

> - blurry interfaces. I don't like to overload arguments when we want in fact
>   be overloading functions, word_new() being the most recent example.
> 
>   I've created facts here now so we have a simple and well-defined
>   interface without "if"s that add complexity. The cleanup removed 15
>   lines from the word.c/word.h pair, which is a low-level data shuffling
>   function. We could split word_new to word_new as creator and word_neww
>   as copy constructor.
> 
> - too much dead code all over the place that need to be remembered and
>   touched for trivial changes, switch() constructs in option parsing for
>   instance
> 
> - still too much code duplication down deep, with dbe_cleanup_lite
>   vs. dbx_common_close,
> 
> - come close to the point where the code is unmaintainable.

We used to have a spam filter with fairly simple database code (and a
need to backup the database file in case of trouble).  Now we have a
spam filter whose 2nd biggest component is database support.  

The fact that it _can_ support 4 distinct database systems (BerkeleyDB,
tdb, qdbm, and sqlite) indicates that _something_ in the datastore
architecture is done right.  Of course it could also be argued that
supporting more than 1 database backend is going off on a tangent.

> No matter what is on our schedule, we _MUST_ clean up, and the key
> points to clean up are:
> 
> - option parsing. This must be in once central location per
>   executable. M_* mode sets here, O_DB_* sets in half a dozen files,
>   with callbacks - awful.
> 
>   We want something that keeps --help output, .xml for the man pages,
>   the actual option and the parsing all in one place. I'll look at
>   AutoOpts et al. next week to see if it can help us
> 
> - database backends. we need to kill one layer of indirection as it's
>   only ballast without serving a real purpose, datastore.? will have to
>   disappear before 0.96, and the whole probe/init stuff will need to be
>   redone, streamlined. It has too little power in its current shape.

datastore.? vs datastore_db.? provides an insulating layer such that
bogofilter's core code doesn't need worry about database underpinnings.

Whether the cleanup is necessary or merely desirable is debatable.
Whether it gets done for 0.9x or for 1.x is subject for future debate.

> - database responsibilities. I'm not clear what I mean here, I dislike
>   the idea that we hack around ifdefs, fTransaction and such, rather
>   than just keeping a state variable (struct perhaps) that has
>   information on what the back-end can do, and who's responsible for
>   cleaning up, resizing things and so on.
> 
> - layers. option parsing will move out of data* functions for sure.

Transaction option parsing belongs in transaction oriented files.  The
option code in datastore_db_trans.c can be moved into its own file,
(trans_opts.c or similar).  Not a big deal.

> - perhaps stubs. a stub library to mask the missing data* functions,
>   rather than having one set of stubs in sqlite, one for TDB/QDBM, might
>   be easier. I started such but it became blurry when sqlite3 was added
>   and the DB-TXN+DB-NOTXN-unification took place.
>
> - dead or optional code. I've never liked the idea of serving complaints
>   about 1,000 bytes in code size with #ifdef'ing code out. It is
>   miniscule compared to the lexer, and causes maintenance headaches, and
>   diversity for support, which luckily hasn't struck ourselves too hard,
>   but diversity is awful since it clutters documentation and support
>   mails with too many ifs and buts, which became somewhat obvious when I
>   rewrote doc/README.db last week, which I've done without spending more
>   than a subordinate clause on the option to disable TXN or non-TXN
>   modes at compile-time.

Some quick arithmetic indicates that lexer_v3.l generates approx 8K
code and 40K data.

Disabling transactions saves 10K (of 140K for bogofilter and 155K for
bogoutil).

10K vs 48K puts transactions at approx 20% of parsing.  I've not
calculated a total size for BerkeleyDB support.

There will always be ifs and buts in support queries.  Bogofilter runs
on multiple platforms (of hardware and software flavors) and offers
multiple modes of use -- single user, multiple users with shared or
individual wordlists, etc.

We live in a complicated world where 1 size doesn't fit all -- unless
you do some customizing.
_______________________________________________
Bogofilter-dev mailing list
Bogofilter-dev at bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter-dev