preliminary ignore results

David Relson relson at osagesoftware.com
Mon May 17 23:50:37 CEST 2004


Greetings,

A couple of the lists I subscribe to (or used to) get spammed regularly.
I've saved a number of the messages that bogofilter had trouble with.
The "trouble" was caused by the message's list headers (seen as hammish)
and the bodies (seen as spammish).  The headers and bodies tend to
balance one another and the messages tend to score right around 0.5.

I've got a total of 280 such messages.  From them I found all header
tokens that occurred 100 times or more.  There were 71 of them.  I used
them to create the ignorelist.  The commands for doing this were
(roughly):

  bogofilter -d test -s test.d/msg.*
  bogoutil -d test/wordlist.db > test/wordlist.txt
  egrep -w [12][0-9][0-9] test/wordlist.txt | grep ":" | \
      egrep -v "^(rcvd:|head:)[0-9]" | \
      egrep -v "(mime:|subj:|relson|osagesoftware.com|rcvd:.*\-)" | \
      tee test/ignorelist.txt | wc -l
  bogoutil -l test/ignorelist.db < test/ignorelist.txt

The essential lines of the config files are:

  current.cf
    wordlist=r,wordlist,./test/wordlist.db,1
    wordlist=i,ignore,./test/ignorelist.db,2

  ignore.cf
    wordlist=i,ignore,./test/ignorelist.db,1
    wordlist=r,wordlist,./test/wordlist.db,2

Here are the classification counts:

            Ham  Spam  Unsure

  original:  47    4   30 
  current:    2   67  208 
  ignore:     3  264   10 

Notes:
  original - counts based on original message classification (some 
     months ago).
  current  - counts based on classifications using current wordlist as
     #1.  Making the ignore list #2, effectively makes it unused.
  ignore  - counts based on classifications using ignore list as
     #1.  With the current wordlist as #2, bogofilter occurs the 71
tokens
     in the ignorelist.

The code still needs some cleaning and organizing.  When that's done,
I'll commit it to cvs.

David



More information about the Bogofilter mailing list