Ignore lists [was: Keeping the cruft out ...]

Greg McCann greg at cambria.com
Fri Mar 5 04:07:28 CET 2004


On 3/4/2004 at 8:39 PM Tom Allison <tallison at tacocat.net> wrote:

>If there is a list of words you wish to ignore couldn't you do this?
>
>put your list of ignored words into a file: ~/.bogofilter/ignore
>periodically run the following:
>bogoutil -d wordlist.db | fgrep -v -f ignore > new_wordlist
>mv wordlist.db wordlist.db.bak
>bogoutil -l wordlist.db < new_wordlist
>
>(or something like that)

This is tested...

bogoutil -d wordlist.db | fgrep -v -f ignore.txt | bogoutil -l wordlist.new.db
mv wordlist.new.db wordlist.db

It is somewhat effective, but I see a couple of limitations.

1.  It does not do whole-word matching.  For example, putting "sex" in your ignore list will elimate "Middlesex", "sextant", etc. from your wordlist.  Is there a way to get regular expressions to work with this - something like "^sex$" or maybe "^sex .*", so it would only match the whole word?  I couldn't get it to work.  When I tried to put regexp characters into my ignore list, it took them literally.

2.  It may cause problems if your wordlist is updated automatically.  Presumably you want these words ignored because they reduce bogofilter's scoring accuracy.  But if you eliminate all occurrences of a token from the wordlist, then the effect of that token being added again before you can refilter the wordlist will be magnified, I think.  For example, I would put "rcvd:Mar" in my ignore list because my wordlist is auto-updated with new spam, which always contains the abbreviation of the current month in the header.  Therefore "rcvd:Mar" becomes very spammy and I want to ignore it.  However if I just filter it from my wordlist, the first new spam that comes in will make this token 100% spammy again.


Greg






More information about the Bogofilter mailing list