option proliferation
David Relson
relson at osagesoftware.com
Fri Aug 1 18:51:12 CEST 2003
Matthias,
At 12:15 PM 8/1/03, Matthias Andree wrote:
>David Relson <relson at osagesoftware.com> writes:
>
> > The initial implementation used the Graham algorithm. When the
> > Robinson-GM algorithm was added some people were pleased and switched
> > over happily. Others stayed with the Graham algorithm. The decision to
> > support both algorithms was made and resulted in configuration options
> > (to lessen executable size for people who only wanted one algorithm) and
> > command line switches (to specify the algorithm at run time).
> >
> > The implementation of the Robinson-Fisher algorithm occurred after that.
> > Again bogofilter produced better scoring results. Again people didn't
> > want to change. Again more options to allow selecting from 3 algorithms
> > and to allow two-state scoring (ham/spam) or three-state scoring
> > (ham/spam/unsure).
>
>To avoid talking past each other, these are for the major part no
>options that are there for option's sake. The choice of algorithm is a
>technical one (for example, Graham's algorithm counting a token up to
>four times, with the Robinson and Robinson-Fisher algorithms counting it
>once, "present or not"). The "Unsure" classification is a border case,
>unsure is often folded into ham, that's it.
"Unsure" is very useful, leastways for me. It's where bogofilter puts all
the messages that it can't _for_sure_ classify. About 10 days ago I
checked message counts for July and found that I had approx 2300 spam, 3500
ham, and 125 unsure. Of the unsure about 75 were spam and 50 were
ham. Knowing which messages are unclear (to bogofilter) is very useful in
training bogofilter to do better. I consider the "Unsure" group to be the
most important group for training and for maintaining the high quality of
bogofilter's classifications.
>Convenience options, like options about the input format with unclear
>usefulness, options that let the use configure the output format in a
>machine-readable format, such things must go IMHO. Anything that can
>cause confusion must go.
The three input formats - mailbox, maildir, and message count - are all
useful, each in its own way. Output formatting, while not absolutely
necessary, helps make the results useful for the many environments the
program is used in.
> > Result: bogofilter has a variety of ways of handling many different
> > situations and it _does_ complicate the code. We _could_ reduce the
> > number of options by keeping the best features, i.e. hard-wiring some
> > (many?) of the defaults and deleting the other code. Making big changes
> > like this would force many users to change their usage of bogofilter,
> > something I don't think is necessary or worthwhile.
>
>It will become necessary as the code grows further. It is about time to
>get a broom and sweep the hut clean.
>
> > Whey I queried the list about this, everyone else liked the idea. You
> > were against it, but didn't offer an alternative (AFAICT). We _could_
> > change it as you suggest. Alternatively, if we need additional exit
> > codes in the future we can make them higher, i.e. 4, 5, etc.
>
>The point is, 0/1/2 are established. Now we're overloading an existing
>"ERROR" condition with a "SUCCESS, STATUS IS UNKNOWN" meaning. What for?
>Why do we ask users to change their scripts?
>
> >>WE NEED NO OPTIONS TO CONFIGURE IF SOMEONE WANTS Y/N/? OR S/H/U OR 1/0/?.
> >
> > 'Tis true. They aren't necessary. They are niceties and _are_ used.
>
>Frankly, I don't care too much if they are used. This stuff must be
>tested and supported. Many alternative code paths complicate the
>bogofilter manual and make it harder to see what happens, in bogofilter,
>in the user's integration scripts. If the exitcode interface doesn't
>offer what the user needs, we need a different interface, rather than
>more options.
exitcodes are useful to scripts that run bogofilter and have access to the
codes. Bogofilter's output also needs to be text based so MUA's and
filters can use the information.
>We will get into deep trouble as we have more of these options. If the
>user's script needs to many "if"s or needs to set up too many parts of
>the environment for bogofilter, it's going to fall totally apart in
>terms of support.
A sysadmin decides how he/she wants bogofilter to work in his/her
environment. The supporting scripts are changed to fit those
decisions. Once setup, the environment is quite stable and
unchanging. There's no need for many "if"s to support possibly different
options, because an environment doesn't use many different options.
...[snip]...
>If someone cared to explain why we need configurable exit codes beyond a
>switch "exit code is no spam indication" (in passthrough mode), or why
>"2 means error" is wrong, I might understand the problem. I simply don't
>see why we don't map "Unsure" to 3, or why we don't REQUIRE that a user
>uses a more powerful interface than just exit codes.
As I use the term "exit codes" it refers to the value passed to
exit(). The codes _used_ to be 0,1,2 and have been changed to
0,1,2,3. They are not user changeable (except by editing the source code
and rebuilding).
"Y/N/?", "S/H/U", and "1/0/?" are alphabetic spamicity tags and can be
changed by a config file option.
>Well, what I meant here is that we should move any message delimiter
>recognition out of the lexer, into a separate module. Ultimately, the
>lexer is given one mail and parses it. What would be good to have is a
>ratio of users running bogofilter in passthrough mode compared to
>bogofilter in exit-code-only mode.
Doing as you suggest _would_ definitely simplify matters. Go for it.
> > It seems like we already have this in the build process.
> > libbogofilter.a is built and then linked into the various executables -
> > bogofilter, bogolexer, and bogoutil.
>
>Not quite, libbogofilter.a is a maintainer convenience. We don't have to
>state which program uses which module, the linker uses the needed parts
>and that's it.
>
>We don't have a "read 18364 bytes from position 0x36fdaf8 and return the
>spamicity" or "register this message as spam" library interface yet.
>
>I agree we're not far away from that.
Would you care to take on designing the new API?
> > Fair enough. Please define "solved" so we know what target we're
> aiming for.
>
>"solved" := exit codes 0, 1 and 2 retain the meanings they've had in
>bogofilter 0.13.7.x and unsure becomes some other code (I don't care if
>it's 3 or some other figure from 4 to 99).
If people want them changed, I'm willing to swap error and unsure. The
particular values don't really matter, though I like having error as the
biggest.
David
More information about the Bogofilter
mailing list