support for multiple wordlists

David Relson relson at osagesoftware.com
Tue May 18 02:00:24 CEST 2004


On Mon, 17 May 2004 19:45:29 -0400
Tom Allison wrote:

> David Relson wrote:
> > On Mon, 17 May 2004 17:57:06 -0400
> > Tom Allison wrote:
> > 
> > 
> >>David Relson wrote:
> >>
> >>>Greetings,
> >>>
> >>commas and such make me dizzy.  The order numbers (5,6,7,8) might be
> >
> >>better replaced with a single paramater (eg: wordlist_order).  Also,
> >>the terms you've described above may have redundancy.  Isn't the
> >>ignore and 'R' redundant?  What happens when I have
> >>"wordlist=ignore, ~/ignorelist.db, 7, R" ???  (See your NOTE 3)
> > 
> > 
> > spaces instead of commas would be fine.
> > 
> > 
> >>suggestion:
> >>wordlist_user= ~/.bogofilter/wordlist.db
> >>wordlist_global=/var/lib/bogofilter/wordlist.db
> >>wordlist_ignore=~/.bogofilter/ignorelist.db
> >>wordlist_order= global ignore user (whitespace seperated: " " or
> >>\n...)
> > 
> > 
> > In bogofilter's config file processing code, all lines are of the
> > form"key=value(s)" and there's a list of valid keys.  Having
> > "key_name=value", "key_name2=value", is a problem.
> 
> I wasn't aware of this.  I don't recall seeing any examples of this.
> 
> I was thinking if you did this approach of using a completely distinct
> 
> key for each of the three types of wordlists you presented, then I 
> assumed it would be trivial to modify them in the command line with 
> --wordlist_user=...  similar to how you can modify min_dev et al.
> 
> In this sense, you would simply add four new parameters
> (wordlist_user, wordlist_ignore, wordlist_global, wordlist_order) to
> bogofilter.cf with them defaulting to today's structure of:
> wordlist_ignore=
> wordlist_global=
> wordlist_user=~/.bogofilter/wordlist.db
> wordlist_order=user ignore global  (this doesn't matter as there's
> only one!)
> 
> 
> Or am I missing the idea that you might have many wordlists, not just 
> the three you proposed?

I'm not setting limits on the number of wordlists used.  If someone
wishes to use many at once, I'm providing the opportunity.

> > Also, having a separate order precludes additive operations.
> > 
> > 
> >>This provides specific exclusion from the checks of 'ignore' and 'R'
> >
> >>being required.  And the order of precedence appears in one line of 
> >>configuration file and not across 3 (or more if you have a lot of
> >>REM'ed lines for old stuff)
> >>
> >>How would you affect the seperate wordlists for configurations
> >>(min_dev, threshold, robx... bogotune stuff)?  I think this only
> >>applies to global/user lists.
> > 
> > 
> > There's no effect.  The scoring parameters are applied separately
> > from finding tokens in the wordlist(s).
> >
> 
> So you would have one set up min_dev/robx/robs for both global and
> user wordlists?  I would think this could cost you a lot of
> effectivity.
> 
> I'm thinking ahead and would see an application for this where
> everyone on a mail server would access a global wordlist that is
> administrator managed with something like PI's train on error or
> something very "lean" because it will have to accomodate a lot of
> personal variations.  the '-u' would not be used here.
> Subsequently each user who was interested, would have their own user 
> wordlist (wordlist_user is defined) and could use '-u' and have more 
> training effects on this one as well.

Yikes!  Having separate parameter sets for each wordlist would be a
management nightmare.

...[snip]...

> > Command line parsing uses library function getopt() and optional
> > parameters are a problem.  Given the number of platforms which run
> > bogofilter and the many variants of getopt(), using optional
> > parameters is a no-no.
> > 
> 
> I don't understand this.
> Do you mean 'bogofilter -Sn wordlist_user wordlist_global' is bad?
> Could you do: 'bogofilter -Sn wordlist_user -Sn wordlist_global'
> without a 'no-no'?  Or does the duplication of '-Sn' really send
> things over the edge.

The functions for parsing command line options work best when an option,
say '-Z' either never has an argument or always has an argument.  Having
'-Z' sometimes with an argument and sometimes without an argument leads
to portability problems.

Updating more than 1 wordlist at a time is a no-no.  If you need to
change two, use:

   bogofilter -Sn wordlist_user < message
   bogofilter -Sn wordlist_global < message

Registering the same message in multiple wordlists seems odd to me.  A
user shouldn't be updating the system list and the sysadmin shouldn't be
updating the user's list.


> I thought you could do this based on the manpage use of bogotune -n 
> implying that you could have multiple directories/files listed after
> the -n and similarly for the -s.  My assumption was that the code was
> common.

Bogotune only uses 1 wordlist.  It uses -n and -s to allow specification
of multiple message files for scoring/tuning.

...[snip]...

> I was hoping to find a more straighforward approach to representing
> the different filenames/locations.  I guess it depends on where you
> want to save the information about the wordlist.  You do it as part of
> the definition of the key, "wordlist", I was doing as the name of the
> key, "wordlist_user".

There may be a more straightforward approach.  Having lots of parameters
is complex any way you slice it.

David



More information about the Bogofilter mailing list