option proliferation

David Relson relson at osagesoftware.com
Fri Aug 1 18:51:12 CEST 2003


Matthias,


At 12:15 PM 8/1/03, Matthias Andree wrote:
>David Relson <relson at osagesoftware.com> writes:
>
> > The initial implementation used the Graham algorithm.  When the
> > Robinson-GM algorithm was added some people were pleased and switched
> > over happily.  Others stayed with the Graham algorithm.  The decision to
> > support both algorithms was made and resulted in configuration options
> > (to lessen executable size for people who only wanted one algorithm) and
> > command line switches (to specify the algorithm at run time).
> >
> > The implementation of the Robinson-Fisher algorithm occurred after that.
> > Again bogofilter produced better scoring results.  Again people didn't
> > want to change.  Again more options to allow selecting from 3 algorithms
> > and to allow two-state scoring (ham/spam) or three-state scoring
> > (ham/spam/unsure).
>
>To avoid talking past each other, these are for the major part no
>options that are there for option's sake. The choice of algorithm is a
>technical one (for example, Graham's algorithm counting a token up to
>four times, with the Robinson and Robinson-Fisher algorithms counting it
>once, "present or not"). The "Unsure" classification is a border case,
>unsure is often folded into ham, that's it.

"Unsure" is very useful, leastways for me.  It's where bogofilter puts all 
the messages that it can't _for_sure_ classify.  About 10 days ago I 
checked message counts for July and found that I had approx 2300 spam, 3500 
ham, and 125 unsure.  Of the unsure about 75 were spam and 50 were 
ham.  Knowing which messages are unclear (to bogofilter) is very useful in 
training bogofilter to do better.  I consider the "Unsure" group to be the 
most important group for training and for maintaining the high quality of 
bogofilter's classifications.

>Convenience options, like options about the input format with unclear
>usefulness, options that let the use configure the output format in a
>machine-readable format, such things must go IMHO. Anything that can
>cause confusion must go.

The three input formats - mailbox, maildir, and message count - are all 
useful, each in its own way.  Output formatting, while not absolutely 
necessary, helps make the results useful for the many environments the 
program is used in.

> > Result:  bogofilter has a variety of ways of handling many different
> > situations and it _does_ complicate the code.  We _could_ reduce the
> > number of options by keeping the best features, i.e. hard-wiring some
> > (many?) of the defaults and deleting the other code.  Making big changes
> > like this would force many users to change their usage of bogofilter,
> > something I don't think is necessary or worthwhile.
>
>It will become necessary as the code grows further. It is about time to
>get a broom and sweep the hut clean.
>
> > Whey I queried the list about this, everyone else liked the idea.  You
> > were against it, but didn't offer an alternative (AFAICT).  We _could_
> > change it as you suggest.  Alternatively, if we need additional exit
> > codes in the future we can make them higher, i.e. 4,  5, etc.
>
>The point is, 0/1/2 are established. Now we're overloading an existing
>"ERROR" condition with a "SUCCESS, STATUS IS UNKNOWN" meaning. What for?
>Why do we ask users to change their scripts?
>
> >>WE NEED NO OPTIONS TO CONFIGURE IF SOMEONE WANTS Y/N/? OR S/H/U OR 1/0/?.
> >
> > 'Tis true.  They aren't necessary.  They are niceties and _are_ used.
>
>Frankly, I don't care too much if they are used. This stuff must be
>tested and supported. Many alternative code paths complicate the
>bogofilter manual and make it harder to see what happens, in bogofilter,
>in the user's integration scripts. If the exitcode interface doesn't
>offer what the user needs, we need a different interface, rather than
>more options.

exitcodes are useful to scripts that run bogofilter and have access to the 
codes.  Bogofilter's output also needs to be text based so MUA's and 
filters can use the information.

>We will get into deep trouble as we have more of these options. If the
>user's script needs to many "if"s or needs to set up too many parts of
>the environment for bogofilter, it's going to fall totally apart in
>terms of support.

A sysadmin decides how he/she wants bogofilter to work in his/her 
environment.  The supporting scripts are changed to fit those 
decisions.  Once setup, the environment is quite stable and 
unchanging.  There's no need for many "if"s to support possibly different 
options, because an environment doesn't use many different options.

...[snip]...

>If someone cared to explain why we need configurable exit codes beyond a
>switch "exit code is no spam indication" (in passthrough mode), or why
>"2 means error" is wrong, I might understand the problem. I simply don't
>see why we don't map "Unsure" to 3, or why we don't REQUIRE that a user
>uses a more powerful interface than just exit codes.

As I use the term "exit codes" it refers to the value passed to 
exit().  The codes _used_ to be 0,1,2 and have been changed to 
0,1,2,3.  They are not user changeable (except by editing the source code 
and rebuilding).

"Y/N/?", "S/H/U", and "1/0/?" are alphabetic spamicity tags and can be 
changed by a config file option.

>Well, what I meant here is that we should move any message delimiter
>recognition out of the lexer, into a separate module. Ultimately, the
>lexer is given one mail and parses it. What would be good to have is a
>ratio of users running bogofilter in passthrough mode compared to
>bogofilter in exit-code-only mode.

Doing as you suggest _would_ definitely simplify matters.  Go for it.

> > It seems like we already have this in the build process.
> > libbogofilter.a is built and then linked into the various executables -
> > bogofilter, bogolexer, and bogoutil.
>
>Not quite, libbogofilter.a is a maintainer convenience. We don't have to
>state which program uses which module, the linker uses the needed parts
>and that's it.
>
>We don't have a "read 18364 bytes from position 0x36fdaf8 and return the
>spamicity" or "register this message as spam" library interface yet.
>
>I agree we're not far away from that.

Would you care to take on designing the new API?

> > Fair enough.  Please define "solved" so we know what target we're 
> aiming for.
>
>"solved" := exit codes 0, 1 and 2 retain the meanings they've had in
>bogofilter 0.13.7.x and unsure becomes some other code (I don't care if
>it's 3 or some other figure from 4 to 99).

If people want them changed, I'm willing to swap error and unsure.  The 
particular values don't really matter, though I like having error as the 
biggest.

David





More information about the Bogofilter mailing list