Ignore lists

Matthias Andree matthias.andree at gmx.de
Wed Mar 3 19:17:54 CET 2004


David Relson <relson at osagesoftware.com> writes:

> On Wed, 3 Mar 2004 10:55:44 -0500
> Bob George wrote:
>
>> I very much like the focus of bogofilter -- do one thing fast and
>> well. But I also want to make sure I'm doing what I can outside of it
>> to keep it working optimally. Spamassassin has the bayes_ignore_header
>> feature, but I've missed anything similar for bogofilter (not a
>> criticism!), so I've written a small filter to strip out locally-added
>> headers before training. I don't want to use -H because there are
>> plenty of useful tags in the headers that I DO want scored.
>
> Bogofilter once had a feature called ignore lists.  Their purpose was to
> allow a fast lookup of common words (like "the", "and", etc) and save
> time by avoiding searching the full wordlist.  It was eventually
> realized that, since the ignore list was likely pretty small, most all
> words would require _two_ searches when an ignore list was used.  On
> this basis, the feature was labeled "not useful" and removed.

...

> The hammy tokens are mostly header tokens like "from:gnu.org", the
> ipaddress, etc, etc.  Having an ignore list with 50 or 100 of these
> tokens would allow bogofilter to recognize the spam.
>
> An alternate approach would be to run
>
> 	bogoutil -d ... | egrep -v "(unwanted|words)" | bogoutil -l ...
>
> or something similar.

As I understand it, Bob suggests ignoring headers with a particular name
altogether, pretending, for instance, that all X-Spam-Status: headers
had not been there.

This scheme is employed by many spam taggers, for instance, spamprobe
used to have a hardcoded set of headers it looks at and ignored every
other header.

This feature is a bit different from the "ignore list" - the latter
would apply to tokens, the former would apply to headers. For headers
with a stable ordering, such as Received:, a sophisticated feature, say
"ignore first 7 Received: headers" or "ignore last 7 Received: headers"
can be helpful.

We could provide these headers and let users experiment whether they are
helpful or not.

-- 
Matthias Andree

Encrypt your mail: my GnuPG key ID is 0x052E7D95




More information about the Bogofilter mailing list