compile time options

David Relson relson at osagesoftware.com
Tue Sep 30 15:49:09 CEST 2003


On Tue, 30 Sep 2003 14:28:59 +0200
Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> wrote:


...[snip]...

> OK, I'll start from the man page (1.15.4):
> 
> >        The -t (terse) option tells bogofilter to print an  abbre­
> >        viated  spamicity  message  containing  1  letter  and the
> >        score. Spam is indicated with "Y", ham by "N", and  unsure
> >        by  "U".	 Note: the formatting can be customized using the
> >        config file.
> 
> I think, this can go. -T is for machine readability and does
> what we need.

This is a "not sure".  "-t" offers configurability using bogofilter's
output formatting capabilities.  "-T" is invariant.
  

> >        The -2 option tells bogofilter to binary classify the mes­
> >        sage as either ham or spam, and never as unsure. When this
> >        option is used with -u, a wordlist is always updated.
> > 
> > 
> >        The -3 option tells bogofilter to use tristate classifica­
> >        tion for the message, i.e. classify the	message	 as  ham,
> >        spam, or unsure. This option is effective only if ham_cut­
> >        off is non-zero.
> 
> Those can go, the decision can be made by choosing
> appropriate cutoffs.

These can definitely be eliminated.  They were added to appease people
who wanted better control over using two-state or three-state
classification.  Parameters spam_cutoff and ham_cutoff are what
bogofilter needs.

> >        When reading mbox format, bogofilter relies on  the  empty
> >        line after a mail.
> 
> BTW: We should mention formail -es here which fixes this in
> mboxes.
> 
> >        The  -Bfilename	(bulk  mode)  option  tells bogofilter to
> >        classify multiple objects  (see	the  previous  paragraph)
> 
> Do we need both -b and -B? Isn't one enough?

Using stdin for the file list is necessary because the command line has
length limits.   stdin allows uses like "ls dir1/a* dir2/b* |
bogofilter".  The command line is also useful.  For example "bogofilter
-B dir1 dir2" is cleaner than "echo dir1 dir2 | bogofilter -b".

> >        The  -F	(force)	 ignores  threshold  values when printing
> >        spamicity statistics.
> 
> I don't understand this one, which makes me feel it is not
> needed;-)

When checking to see _why_ bogofilter has done something unexpected,
this is useful.

> >        The  -d	dir  option allows you to set the directory under
> >        which the wordlists will be found to dir. If omitted,  the
> >        default	directory  will	 be  $BOGOFILTER_DIR  if BOGOFIL­
> >        TER_DIR is set and $HOME/.bogofilter otherwise.
> 
> Is that correct? Doesn't the config file come in here?
> Anyhow, this is explained later. So "If omitted ..." should
> be deleted here.

The "default" directory is determined by the environment variables.  It
can be over-ridden using the config file or the command line.  Perhaps
this wording can be clarified.

> >        The -k tag option sets the cache size for  the  BerkeleyDB
> >        subsystem. Properly sizing the cache improves bogofilter's
> >        performance. Run the bogotune script to determine the rec­
> >        ommended size.
> 
> Enough if only in config file.

Many of the command line options also have config file options.  It's a
matter of style and preference as to which one (command line or config
file) is used.

> >        The  -L	tag option configures a tag which can be included
> >        in the information being logged by the -l option,  but  it
> >        requires	 a  custom format that includes the %l string for
> >        now. This option implies -l.
> 
> Enough if only in config file.

Likely so.

> 
> >        The -I filename option tells bogofilter to read its  input
> >        from the specified file, rather than from stdin
> 
> I cannot see a situation where we could not read from stdin.
> So this would be superfluous.

-I and -O are useful when tracing scripts and debugging the code.

> >        The -O filename option tells bogofilter where to write its
> >        output in passthrough mode. Note that this only works when
> >        -p is explicitly given.
> 
> Why not capture this from stdout? So this could also go.
> 
> >        The  -W	 option tells bogofilter to operate with a single
> >        wordlist, named wordlist.db. Each token in wordlist.db  is
> >        stored  as  an  ASCII string with two counts (for spam and
> >        ham) and (optionally) a timestamp.
> > 
> > 
> >        The -WW	option tells bogofilter to operate with a pair of
> >        wordlists,  named spamlist.db and goodlist.db. Spamlist.db
> >        stores tokens, counts, and timestamps for tokens from spam
> >        messages.  Goodlist.db  stores  tokens, counts, and times­
> >        tamps for tokens from ham messages.
> 
> I think those can go. Either we drop the two lists
> completely or you can set it in the config file.

> 
> >        The -O filename option tells bogofilter where to write its
> >        output in passthrough mode. Note that this only works when
> >        -p is explicitly given.
> 
> We had that before. Needs to be fixed in the man page.
> 
> >        The -g option selects the original Graham form of the cal­
> >        culation method.
> > 
> >        The -r option selects the Robinson  modifications  to  the
> >        calculation method.
> > 
> >        The -f option selects the Robinson-Fisher modifications to
> >        the calculation method.
> 
> Those can go, config file is enough.
> 
> >        Bogofilter  has three special parsing options which can be
> >        enabled	(or  disabled)	at  the	 user's	 discretion.  The
> >        options	are  of	 form  -Px  and -PX where x designates an
> >        option letter. For the parsing options, a lower case  let­
> >        ter  enables  the option and an upper case letter disables
> >        it.
> 
> I think they can all go completely. Let's fix the defaults.

Agreed.

> >        The  -m	[value][,value][,value] option allows setting the
> >        min_dev value and, optionally, the robs and  robx  values.
> 
> >        The -o [value][,value] option allows setting the spam_cut­
> >        off  value  and,	 optionally, the ham_cutoff value.
> 
> Useful for testing, but it could be done using the -c
> switch. I'd leave them in.

Many of these command line options are used in the regression tests
("make check").  Having to generate config files for -W, -WW, -k, -m,
-o, etc would be a pain in the butt and would make the test scripts much
bigger and harder to 

> >        Option -y date specifies the date to give to  tokens  that
> >        don't have dates.
> 
> Is that relevant for bogofilter? Or should that be bogoutil?

Can be used to turn off timestamps, thus saving database size.

> 
> > ENVIRONMENT
> >        Bogofilter will initialize  its	data  base  directory  to
> >        $BOGOFILTER_DIR	if  BOGOFILTER_DIR  is	set. If it is not
> >        set, bogofilter will  use  $HOME/.bogofilter  instead.  If
> >        neither	BOGOFILTER_DIR nor HOME is set, the -d dir option
> >        must be present.
> 
> With the combined wordlist, we only have one file in that
> directory. So it would be good enough to name the file directly.

Maybe.  Need to think about it ...




More information about the Bogofilter mailing list