option proliferation

Fri Aug 1 18:15:39 CEST 2003

David Relson <relson at osagesoftware.com> writes:

> The initial implementation used the Graham algorithm.  When the
> Robinson-GM algorithm was added some people were pleased and switched
> over happily.  Others stayed with the Graham algorithm.  The decision to
> support both algorithms was made and resulted in configuration options
> (to lessen executable size for people who only wanted one algorithm) and
> command line switches (to specify the algorithm at run time).
>
> The implementation of the Robinson-Fisher algorithm occurred after that.
> Again bogofilter produced better scoring results.  Again people didn't
> want to change.  Again more options to allow selecting from 3 algorithms
> and to allow two-state scoring (ham/spam) or three-state scoring
> (ham/spam/unsure).

To avoid talking past each other, these are for the major part no
options that are there for option's sake. The choice of algorithm is a
technical one (for example, Graham's algorithm counting a token up to
four times, with the Robinson and Robinson-Fisher algorithms counting it
once, "present or not"). The "Unsure" classification is a border case,
unsure is often folded into ham, that's it.

Convenience options, like options about the input format with unclear
usefulness, options that let the use configure the output format in a
machine-readable format, such things must go IMHO. Anything that can
cause confusion must go.

> Result:  bogofilter has a variety of ways of handling many different
> situations and it _does_ complicate the code.  We _could_ reduce the
> number of options by keeping the best features, i.e. hard-wiring some
> (many?) of the defaults and deleting the other code.  Making big changes
> like this would force many users to change their usage of bogofilter,
> something I don't think is necessary or worthwhile.

It will become necessary as the code grows further. It is about time to
get a broom and sweep the hut clean.

> Whey I queried the list about this, everyone else liked the idea.  You
> were against it, but didn't offer an alternative (AFAICT).  We _could_
> change it as you suggest.  Alternatively, if we need additional exit
> codes in the future we can make them higher, i.e. 4,  5, etc.

The point is, 0/1/2 are established. Now we're overloading an existing
"ERROR" condition with a "SUCCESS, STATUS IS UNKNOWN" meaning. What for?
Why do we ask users to change their scripts?

>>WE NEED NO OPTIONS TO CONFIGURE IF SOMEONE WANTS Y/N/? OR S/H/U OR 1/0/?.
>
> 'Tis true.  They aren't necessary.  They are niceties and _are_ used.

Frankly, I don't care too much if they are used. This stuff must be
tested and supported. Many alternative code paths complicate the
bogofilter manual and make it harder to see what happens, in bogofilter,
in the user's integration scripts. If the exitcode interface doesn't
offer what the user needs, we need a different interface, rather than
more options.

We will get into deep trouble as we have more of these options. If the
user's script needs to many "if"s or needs to set up too many parts of
the environment for bogofilter, it's going to fall totally apart in
terms of support.

>>All these options, particularly if changing established behaviour, make
>>supporting the software difficult and prone to failure. This violates
>>the simplest principle: keep it simple, stupid.
>
> As we've learned more about processing email, we've learned that
> established behavior can be wrong.  Dropping the old way and supporting
> only the new way breaks more than it fixes, I do think.

If someone cared to explain why we need configurable exit codes beyond a
switch "exit code is no spam indication" (in passthrough mode), or why
"2 means error" is wrong, I might understand the problem. I simply don't
see why we don't map "Unsure" to 3, or why we don't REQUIRE that a user
uses a more powerful interface than just exit codes.

(procmail argumentations don't count, procmail supports backticks and
most procmail r c files I've seen are broken anyways.)

> You're overlooking multiple speed enhancements, particularly the
> combined wordlist.  There's also been profiling work that has resulted
> in rewriting multiple code "hot spots".

This hasn't been addresses by any of my mails in these threads, and I
don't object to the "combined" list. What I dislike is that both DB and
TDB make assumptions about the record ("value") structure, but I still
don't care because that's a decision that's made once (at install or
first-use time) and then ignored for the most part.

>>There is really no need for bogofilter to understand different mailbox
>>formats, this is something a wrapper can do that we will ship.
>
> True, it's not _necessary_.  It _is_ valuable to some of our big
> users.

Well, what I meant here is that we should move any message delimiter
recognition out of the lexer, into a separate module. Ultimately, the
lexer is given one mail and parses it. What would be good to have is a
ratio of users running bogofilter in passthrough mode compared to
bogofilter in exit-code-only mode.

> It seems like we already have this in the build process.
> libbogofilter.a is built and then linked into the various executables -
> bogofilter, bogolexer, and bogoutil.

Not quite, libbogofilter.a is a maintainer convenience. We don't have to
state which program uses which module, the linker uses the needed parts
and that's it.

We don't have a "read 18364 bytes from position 0x36fdaf8 and return the
spamicity" or "register this message as spam" library interface yet.

I agree we're not far away from that.

> Fair enough.  Please define "solved" so we know what target we're aiming for.

"solved" := exit codes 0, 1 and 2 retain the meanings they've had in
bogofilter 0.13.7.x and unsure becomes some other code (I don't care if
it's 3 or some other figure from 4 to 99).

-- 
Matthias Andree