Ignore lists [was: Keeping the cruft out ...]

David Relson relson at osagesoftware.com
Wed Mar 3 17:49:30 CET 2004


On Wed, 3 Mar 2004 10:55:44 -0500
Bob George wrote:

> "Eric Wood" <eric at interplas.com> wrote:
> 
> > Bob George wrote:
> > > I'm going to some lengths to avoid cruft in bayes as well:
> >
> > I know there are mailling list purists here that would like to move
> > this dicussion on the procmail mail list (which I'm also a member
> > of), but I believe there should be a little more maildrop/procmail
> > help on the bogofilter website to get people started in the right
> > direction.
> 
> I very much like the focus of bogofilter -- do one thing fast and
> well. But I also want to make sure I'm doing what I can outside of it
> to keep it working optimally. Spamassassin has the bayes_ignore_header
> feature, but I've missed anything similar for bogofilter (not a
> criticism!), so I've written a small filter to strip out locally-added
> headers before training. I don't want to use -H because there are
> plenty of useful tags in the headers that I DO want scored.
> 
> Perhaps I'm going about this the wrong way. Is there a way to flag a
> phrase/header so it's NOT used in scoring one way or the other? To
> "drop" items learned in error? I could then just feed a file with
> "things to ignore" once. I've re-read the manpage, but don't see
> anything obvious.
> 
> - Bob

Hi Bob,

Bogofilter once had a feature called ignore lists.  Their purpose was to
allow a fast lookup of common words (like "the", "and", etc) and save
time by avoiding searching the full wordlist.  It was eventually
realized that, since the ignore list was likely pretty small, most all
words would require _two_ searches when an ignore list was used.  On
this basis, the feature was labeled "not useful" and removed.

More recently, I've been thinking of resurrecting it and using it at my
site.  Here's why:

gnu.org lists are open to all without validation, hence they _do_ get
spammed.  From these lists, I see a lot of "Unsures" with histograms
like:

X-Bogosity: Unsure, tests=bogofilter, spamicity=0.479794, version=0.15.9

#  int  cnt   prob  spamicity histogram
# 0.00   47 0.012718 0.009778
###############################################
# 0.10    0 0.000000 0.009778 
# 0.20    0 0.000000 0.009778 
# 0.30    0 0.000000 0.009778 
# 0.40    0 0.000000 0.009778 
# 0.50    0 0.000000 0.009778 
# 0.60    0 0.000000 0.009778 
# 0.70    0 0.000000 0.009778 
# 0.80    0 0.000000 0.009778 
# 0.90   15 0.979402 0.417756 ###############

The hammy tokens are mostly header tokens like "from:gnu.org", the
ipaddress, etc, etc.  Having an ignore list with 50 or 100 of these
tokens would allow bogofilter to recognize the spam.

An alternate approach would be to run

	bogoutil -d ... | egrep -v "(unwanted|words)" | bogoutil -l ...

or something similar.





More information about the Bogofilter mailing list