compile time options
David Relson
relson at osagesoftware.com
Tue Sep 30 15:49:09 CEST 2003
On Tue, 30 Sep 2003 14:28:59 +0200
Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> wrote:
...[snip]...
> OK, I'll start from the man page (1.15.4):
>
> > The -t (terse) option tells bogofilter to print an abbre
> > viated spamicity message containing 1 letter and the
> > score. Spam is indicated with "Y", ham by "N", and unsure
> > by "U". Note: the formatting can be customized using the
> > config file.
>
> I think, this can go. -T is for machine readability and does
> what we need.
This is a "not sure". "-t" offers configurability using bogofilter's
output formatting capabilities. "-T" is invariant.
> > The -2 option tells bogofilter to binary classify the mes
> > sage as either ham or spam, and never as unsure. When this
> > option is used with -u, a wordlist is always updated.
> >
> >
> > The -3 option tells bogofilter to use tristate classifica
> > tion for the message, i.e. classify the message as ham,
> > spam, or unsure. This option is effective only if ham_cut
> > off is non-zero.
>
> Those can go, the decision can be made by choosing
> appropriate cutoffs.
These can definitely be eliminated. They were added to appease people
who wanted better control over using two-state or three-state
classification. Parameters spam_cutoff and ham_cutoff are what
bogofilter needs.
> > When reading mbox format, bogofilter relies on the empty
> > line after a mail.
>
> BTW: We should mention formail -es here which fixes this in
> mboxes.
>
> > The -Bfilename (bulk mode) option tells bogofilter to
> > classify multiple objects (see the previous paragraph)
>
> Do we need both -b and -B? Isn't one enough?
Using stdin for the file list is necessary because the command line has
length limits. stdin allows uses like "ls dir1/a* dir2/b* |
bogofilter". The command line is also useful. For example "bogofilter
-B dir1 dir2" is cleaner than "echo dir1 dir2 | bogofilter -b".
> > The -F (force) ignores threshold values when printing
> > spamicity statistics.
>
> I don't understand this one, which makes me feel it is not
> needed;-)
When checking to see _why_ bogofilter has done something unexpected,
this is useful.
> > The -d dir option allows you to set the directory under
> > which the wordlists will be found to dir. If omitted, the
> > default directory will be $BOGOFILTER_DIR if BOGOFIL
> > TER_DIR is set and $HOME/.bogofilter otherwise.
>
> Is that correct? Doesn't the config file come in here?
> Anyhow, this is explained later. So "If omitted ..." should
> be deleted here.
The "default" directory is determined by the environment variables. It
can be over-ridden using the config file or the command line. Perhaps
this wording can be clarified.
> > The -k tag option sets the cache size for the BerkeleyDB
> > subsystem. Properly sizing the cache improves bogofilter's
> > performance. Run the bogotune script to determine the rec
> > ommended size.
>
> Enough if only in config file.
Many of the command line options also have config file options. It's a
matter of style and preference as to which one (command line or config
file) is used.
> > The -L tag option configures a tag which can be included
> > in the information being logged by the -l option, but it
> > requires a custom format that includes the %l string for
> > now. This option implies -l.
>
> Enough if only in config file.
Likely so.
>
> > The -I filename option tells bogofilter to read its input
> > from the specified file, rather than from stdin
>
> I cannot see a situation where we could not read from stdin.
> So this would be superfluous.
-I and -O are useful when tracing scripts and debugging the code.
> > The -O filename option tells bogofilter where to write its
> > output in passthrough mode. Note that this only works when
> > -p is explicitly given.
>
> Why not capture this from stdout? So this could also go.
>
> > The -W option tells bogofilter to operate with a single
> > wordlist, named wordlist.db. Each token in wordlist.db is
> > stored as an ASCII string with two counts (for spam and
> > ham) and (optionally) a timestamp.
> >
> >
> > The -WW option tells bogofilter to operate with a pair of
> > wordlists, named spamlist.db and goodlist.db. Spamlist.db
> > stores tokens, counts, and timestamps for tokens from spam
> > messages. Goodlist.db stores tokens, counts, and times
> > tamps for tokens from ham messages.
>
> I think those can go. Either we drop the two lists
> completely or you can set it in the config file.
>
> > The -O filename option tells bogofilter where to write its
> > output in passthrough mode. Note that this only works when
> > -p is explicitly given.
>
> We had that before. Needs to be fixed in the man page.
>
> > The -g option selects the original Graham form of the cal
> > culation method.
> >
> > The -r option selects the Robinson modifications to the
> > calculation method.
> >
> > The -f option selects the Robinson-Fisher modifications to
> > the calculation method.
>
> Those can go, config file is enough.
>
> > Bogofilter has three special parsing options which can be
> > enabled (or disabled) at the user's discretion. The
> > options are of form -Px and -PX where x designates an
> > option letter. For the parsing options, a lower case let
> > ter enables the option and an upper case letter disables
> > it.
>
> I think they can all go completely. Let's fix the defaults.
Agreed.
> > The -m [value][,value][,value] option allows setting the
> > min_dev value and, optionally, the robs and robx values.
>
> > The -o [value][,value] option allows setting the spam_cut
> > off value and, optionally, the ham_cutoff value.
>
> Useful for testing, but it could be done using the -c
> switch. I'd leave them in.
Many of these command line options are used in the regression tests
("make check"). Having to generate config files for -W, -WW, -k, -m,
-o, etc would be a pain in the butt and would make the test scripts much
bigger and harder to
> > Option -y date specifies the date to give to tokens that
> > don't have dates.
>
> Is that relevant for bogofilter? Or should that be bogoutil?
Can be used to turn off timestamps, thus saving database size.
>
> > ENVIRONMENT
> > Bogofilter will initialize its data base directory to
> > $BOGOFILTER_DIR if BOGOFILTER_DIR is set. If it is not
> > set, bogofilter will use $HOME/.bogofilter instead. If
> > neither BOGOFILTER_DIR nor HOME is set, the -d dir option
> > must be present.
>
> With the combined wordlist, we only have one file in that
> directory. So it would be good enough to name the file directly.
Maybe. Need to think about it ...
More information about the Bogofilter
mailing list