Ignore lists [was: Keeping the cruft out ...]

Fri Mar 5 04:33:58 CET 2004

On Thu, 04 Mar 2004 19:07:28 -0800
Greg McCann wrote:

> On 3/4/2004 at 8:39 PM Tom Allison <tallison at tacocat.net> wrote:
> 
> >If there is a list of words you wish to ignore couldn't you do this?
> >
> >put your list of ignored words into a file: ~/.bogofilter/ignore
> >periodically run the following:
> >bogoutil -d wordlist.db | fgrep -v -f ignore > new_wordlist
> >mv wordlist.db wordlist.db.bak
> >bogoutil -l wordlist.db < new_wordlist
> >
> >(or something like that)
> 
> This is tested...
> 
> bogoutil -d wordlist.db | fgrep -v -f ignore.txt | bogoutil -l
> wordlist.new.db mv wordlist.new.db wordlist.db
> 
> It is somewhat effective, but I see a couple of limitations.

With a properly formatted list, "egrep -v -f ignore.list" can be used
for deleting tokens.  Use a caret "^" at the beginning of each line and
a space " " at the end of each line.  That way only complete tokens will
be matched and deleted.

Certainly this is simpler than dealing with a second, special wordlist. 
Whether it will work better, I can't say.

> 1.  It does not do whole-word matching.  For example, putting "sex" in
> your ignore list will elimate "Middlesex", "sextant", etc. from your
> wordlist.  Is there a way to get regular expressions to work with this
> - something like "^sex$" or maybe "^sex .*", so it would only match
> the whole word?  I couldn't get it to work.  When I tried to put
> regexp characters into my ignore list, it took them literally.
> 
> 2.  It may cause problems if your wordlist is updated automatically. 
> Presumably you want these words ignored because they reduce
> bogofilter's scoring accuracy.  But if you eliminate all occurrences
> of a token from the wordlist, then the effect of that token being
> added again before you can refilter the wordlist will be magnified, I
> think.  For example, I would put "rcvd:Mar" in my ignore list because
> my wordlist is auto-updated with new spam, which always contains the
> abbreviation of the current month in the header.  Therefore "rcvd:Mar"
> becomes very spammy and I want to ignore it.  However if I just filter
> it from my wordlist, the first new spam that comes in will make this
> token 100% spammy again.

Single tokens don't have a lot of effect, hence I wouldn't worry about
the reappearance of "rcvd:Mar" being 100% spamming.  Remember it's the
totality of all the tokens in the message.