help using the new ignore wordlist feature

David Relson relson at osagesoftware.com
Mon Jun 14 16:52:04 CEST 2004


On Mon, 14 Jun 2004 09:00:57 -0400
Eric Wood wrote:

> I've sort of been confused by all the discussion, but I gather that
> the purpose of the ignore wordlist is to make spam detection slightly
> more acurate.
> 
> 1. From the FAQ's "Can I tell bogofilter to ignore certain tokens?" I
> can see how to make my own ignorelist.  Now, does the ignore wordlist
> mainly supposed to contain only insignificant words (tokens), ie "is",
> "of", "or","a", "the", etc.?  Does someone have such a list they
> already compiled for English.
> 
> 2. Where's "Note 2" in the FAQ under "Can I use multiple wordlists?"
> 
> Thanks,
> -eric wood

Hi Eric,

Common words like "a" and "the" commonly have scores around 0.5, hence
are excluded by the min_dev value.  Here's an example of where I've
found the ignore list to be useful:

The gnu.org mailing lists are open to all posters, with no subscribing
needed.  This leaves them open to spammers.  Since the great majority of
messages from those lists are ham, my wordlists have a lot of strongly
hammish tokens from those lists, i.e.

    List-Help
    List-Id
    List-Post
    List-Subscribe
    List-Unsubscribe
    head:List-Archive
    head:List-Help
    head:List-Id
    head:List-Post
    head:List-Subscribe
    head:List-Unsubscribe
    head:Precedence
    head:Sender
    head:X-Mailman-Version
    head:list
    head:listinfo
    head:lists
    head:mailarchive
    head:mailman
    head:mailto
    head:pipermail
    head:tracker
    head:unsubscribe
    listinfo
    lists
    mime:MIME-Version
    rcvd:esmtp
    rcvd:helo
    rcvd:invoked

Having all these tokens in wordlist.db causes every message from those
lists to have lots of hammish tokens, i.e. they bias the score towards
ham.  When spam comes in through the list, it usually scores (for me) as
Unsure because it has a bunch of ham tokens and a bunch of spam tokens.

Putting the above tokens into ignore.db results in bogofilter's score
being based more on the body of the message than on its header and gives
more accurate results -- and _that's_ how I use ignorelists!

To answer your question, the initial write-up (my email of May 17, with
Notes 1, 2, and 3) became the FAQ write-up.  Note 2 was about ignore
lists, which became a separate question.  Evidently I forgot to renumber
the remaining notes (which has now been attended to).

HTH,

David



More information about the Bogofilter mailing list