support for multiple wordlists

Tom Anderson tanderso at oac-design.com
Tue May 18 00:56:45 CEST 2004


On Mon, 2004-05-17 at 11:36, David Relson wrote:
> > How about this:
> > wordlist=whitelist, ~/ignorelist.db, 5, I
> > wordlist=global_zoology_whitelist,
> > http://blah.org/bogofilter/ignorelists/zoology.db, 6, I
> > wordlist=primary, ~/wordlist.db, 7, R
> > wordlist=system, /var/spool/bogofilter/wordlist.db, 8, R
> > wordlist=zoology_spamtrap,
> > http://blah.org/bogofilter/wordlists/spamtraps/zoology.db, 9, R
> > wordlist=general_spamtrap,
> > http://blah.org/bogofilter/wordlists/spamtraps/general.db, 9, R
> > 
> > Where, obviously, people could maintain wordlists and ignorelists for
> > select groups of people with similar kinds of email.  The above likely
> > belonging to a zoologist.  Being able to retrieve wordlists from
> > remote sites (perhaps cached once a day locally, or else searched one
> > token at a time remotely via some server-side code) I think would be a
> > huge advantage.
> 
> Again, I don't see value in adding "remote database" code to bogofilter.
>  wget and rsync already exist and can do retrievals perfectly well.

David,

The value is precisely the same as for using a "system" list, except
that some people don't have a "system".  This way, individual users can
still have the benefit of failover to a less tailored, more general
wordlist when a token is not found in their own list.  Imagine the space
savings possible when 50+% of tokens are sufficiently scored in a remote
list.  Thousands of people can use the same 20M list without having to
laboriously train their own.  Corrections can go in their own individual
list for increased accuracy, but with significantly less resources than
maintaining the whole thing locally.

Moreover, as I described above, lists tailored for specific kinds of
people such as programmers, doctors, lawyers, housewives, sports
enthusiasts, etc., can get new users off to a great start very quickly. 
Similarly ignore lists for doctors might come prepacked with terms like
"viagra", while ignore lists for chefs might include "breast", etc.

Plus, a remote list can be maintained somewhat more "professionally"
than general users might be able to achieve on their own.  And they can
be trained on spamtrap spams that have just been released into the wild
before the users of the list have ever received the new spams.  

Clearly, these are very valuable reasons for including the capability of
looking up tokens in wordlists over the internet.  Wget and rsync are
not viable options.  You need to think like an end-user... the entire
process needs to be as transparent as possible.  You should be able to
set a config option once at setup, and everything else should happen
automatically.

Tom





More information about the Bogofilter mailing list