matthias.andree at gmx.de
Wed Mar 3 13:17:54 EST 2004
David Relson <relson at osagesoftware.com> writes:
> On Wed, 3 Mar 2004 10:55:44 -0500
> Bob George wrote:
>> I very much like the focus of bogofilter -- do one thing fast and
>> well. But I also want to make sure I'm doing what I can outside of it
>> to keep it working optimally. Spamassassin has the bayes_ignore_header
>> feature, but I've missed anything similar for bogofilter (not a
>> criticism!), so I've written a small filter to strip out locally-added
>> headers before training. I don't want to use -H because there are
>> plenty of useful tags in the headers that I DO want scored.
> Bogofilter once had a feature called ignore lists. Their purpose was to
> allow a fast lookup of common words (like "the", "and", etc) and save
> time by avoiding searching the full wordlist. It was eventually
> realized that, since the ignore list was likely pretty small, most all
> words would require _two_ searches when an ignore list was used. On
> this basis, the feature was labeled "not useful" and removed.
> The hammy tokens are mostly header tokens like "from:gnu.org", the
> ipaddress, etc, etc. Having an ignore list with 50 or 100 of these
> tokens would allow bogofilter to recognize the spam.
> An alternate approach would be to run
> bogoutil -d ... | egrep -v "(unwanted|words)" | bogoutil -l ...
> or something similar.
As I understand it, Bob suggests ignoring headers with a particular name
altogether, pretending, for instance, that all X-Spam-Status: headers
had not been there.
This scheme is employed by many spam taggers, for instance, spamprobe
used to have a hardcoded set of headers it looks at and ignored every
This feature is a bit different from the "ignore list" - the latter
would apply to tokens, the former would apply to headers. For headers
with a stable ordering, such as Received:, a sophisticated feature, say
"ignore first 7 Received: headers" or "ignore last 7 Received: headers"
can be helpful.
We could provide these headers and let users experiment whether they are
helpful or not.
Encrypt your mail: my GnuPG key ID is 0x052E7D95
More information about the Bogofilter