support for multiple wordlists
relson at osagesoftware.com
Mon May 17 19:08:07 EDT 2004
On Mon, 17 May 2004 17:57:06 -0400
Tom Allison wrote:
> David Relson wrote:
> > Greetings,
> > At one time, bogofilter had support for multiple wordlists. I'm
> > thinking of resurrecting the code. Here's how I think it should
> > operate:
> > Wordlists have a number of attributes, notably name, filename,
> > precedence, and type.
> > Name: a short identifying symbol used when printing (error)
> > messages. Examples are "global", "user", "ignore".
> > Filename: When opening the wordlist, if the name is fully
> > qualifified(with a leading '/' or '~'), that name is used, else the
> > usual search order is used, i.e. $BOGOFILTER_DIR, $BOGODIR, $HOME.
> > Precedence: an integer like 1, 2, 3, ... Wordlists are searched in
> > ascending order for the token. If the search token is found, lists
> > with the same precedence number will be checked (and counts added
> > together). Lists with higher precedence numbers will not be checked.
> > Type: 'R' and 'I' (for "regular" and "ignore"). Current wordlists
> > are of type 'R'. Type 'I' means "don't score the token if found in
> > the ignore list".
> I had assumed that if you you had both /etc/bogofilter/wordlist.db (or
> /var/lib/bogofilter/wordlist.db) and ~/.bogofilter/wordlist.db that
> they might be shared in some way (probably with global first, user
> second, just like procmail rules).
> I guess I was just thinking of going with lots of procmail glue to
> make this all happen.
It's not at all obvious to me whether "global then user" is preferable
to "user then global". Precedence varies with environment.
> commas and such make me dizzy. The order numbers (5,6,7,8) might be
> better replaced with a single paramater (eg: wordlist_order). Also,
> the terms you've described above may have redundancy. Isn't the
> ignore and 'R' redundant? What happens when I have
> "wordlist=ignore, ~/ignorelist.db, 7, R" ??? (See your NOTE 3)
spaces instead of commas would be fine.
> wordlist_user= ~/.bogofilter/wordlist.db
> wordlist_order= global ignore user (whitespace seperated: " " or
In bogofilter's config file processing code, all lines are of the form
"key=value(s)" and there's a list of valid keys. Having
"key_name=value", "key_name2=value", is a problem.
Also, having a separate order precludes additive operations.
> This provides specific exclusion from the checks of 'ignore' and 'R'
> being required. And the order of precedence appears in one line of
> configuration file and not across 3 (or more if you have a lot of
> REM'ed lines for old stuff)
> How would you affect the seperate wordlists for configurations
> (min_dev, threshold, robx... bogotune stuff)? I think this only
> applies to global/user lists.
There's no effect. The scoring parameters are applied separately from
finding tokens in the wordlist(s).
> > Note 1: bogofilter's registration flags ('-s', '-n', '-u', '-S',
> > '-N' ) will apply to the first list named.
> Similar to bogotune could you default to the wordlist_user for these
> params unless you specified otherwise. Not sure, but maybe:
> bogofilter -u ==> defaults to wordlist_user
> bogofilter -u wordlist_global ==> only wordlist_global
> bogofilter -u wordlist_global wordlist_user ==> does both: space
> seperated list?
> A really complicated version would be something like:
> bogofilter -pe wordlist_global -u wordlist_user (assumes previous
> -pe?) bogofilter -n wordlist_user -Sn wordlist_global
> bogofilter -Sn wordlist_user wordlist_global (space seperated list
> affects both)
Command line parsing uses library function getopt() and optional
parameters are a problem. Given the number of platforms which run
bogofilter and the many variants of getopt(), using optional parameters
is a no-no.
> (the -peu example above might be pretty lame...)
> > Note 2: to build an ignore list, create a text file (for example,
> > ignorelist.txt) using any text editor, then use bogoutil to convert
> > it to database format, e.g. "bogoutil -l ignorelist.db <
> > ignorelist.txt".
> OK: echo "foo" | bogoutil -l ignorelist.db
> should work as well for individuals.
Right! Thanks for the reminder.
> > Note 3: having lists of types 'R' and 'I' of the same precedence
> > won't be allowed because the types are contradictory.
> See comments about wordlist_user and such above. I think you can
> relabel the parameters and exclude this problem from happening.
Relabeling prevents this problem and causes others...
More information about the Bogofilter