Ignore lists [was: Keeping the cruft out ...]
David Relson
relson at osagesoftware.com
Fri Mar 5 04:33:58 CET 2004
On Thu, 04 Mar 2004 19:07:28 -0800
Greg McCann wrote:
> On 3/4/2004 at 8:39 PM Tom Allison <tallison at tacocat.net> wrote:
>
> >If there is a list of words you wish to ignore couldn't you do this?
> >
> >put your list of ignored words into a file: ~/.bogofilter/ignore
> >periodically run the following:
> >bogoutil -d wordlist.db | fgrep -v -f ignore > new_wordlist
> >mv wordlist.db wordlist.db.bak
> >bogoutil -l wordlist.db < new_wordlist
> >
> >(or something like that)
>
> This is tested...
>
> bogoutil -d wordlist.db | fgrep -v -f ignore.txt | bogoutil -l
> wordlist.new.db mv wordlist.new.db wordlist.db
>
> It is somewhat effective, but I see a couple of limitations.
With a properly formatted list, "egrep -v -f ignore.list" can be used
for deleting tokens. Use a caret "^" at the beginning of each line and
a space " " at the end of each line. That way only complete tokens will
be matched and deleted.
Certainly this is simpler than dealing with a second, special wordlist.
Whether it will work better, I can't say.
> 1. It does not do whole-word matching. For example, putting "sex" in
> your ignore list will elimate "Middlesex", "sextant", etc. from your
> wordlist. Is there a way to get regular expressions to work with this
> - something like "^sex$" or maybe "^sex .*", so it would only match
> the whole word? I couldn't get it to work. When I tried to put
> regexp characters into my ignore list, it took them literally.
>
> 2. It may cause problems if your wordlist is updated automatically.
> Presumably you want these words ignored because they reduce
> bogofilter's scoring accuracy. But if you eliminate all occurrences
> of a token from the wordlist, then the effect of that token being
> added again before you can refilter the wordlist will be magnified, I
> think. For example, I would put "rcvd:Mar" in my ignore list because
> my wordlist is auto-updated with new spam, which always contains the
> abbreviation of the current month in the header. Therefore "rcvd:Mar"
> becomes very spammy and I want to ignore it. However if I just filter
> it from my wordlist, the first new spam that comes in will make this
> token 100% spammy again.
Single tokens don't have a lot of effect, hence I wouldn't worry about
the reappearance of "rcvd:Mar" being 100% spamming. Remember it's the
totality of all the tokens in the message.
More information about the Bogofilter
mailing list