How do I filter out spam that turns up on mailing lists?

David Relson relson at osagesoftware.com
Sat Jan 26 20:41:25 CET 2008


On Sat, 26 Jan 2008 19:14:54 +0100
Nigel Henry wrote:

> On Saturday 26 January 2008 01:16, David Relson wrote:
> > On Fri, 25 Jan 2008 20:22:45 +0100
> > Nigel Henry wrote:
> >
> > ...[snip]...
> >
> > > Hi David. Meanwhile back at the ranch, I'm not really on my way to
> > > creating this ignore.db. Not being one to give up (although a few
> > > days have passed), here's how things stand at present.
> > >
> > > I already had an /etc/bogofilter.cf.example file, but also created
> > > an /etc/bogofilter.cf file. I have added the following 2 lines to
> > > this newly created file.
> > >
> > > wordlist i,ignore,ignore.db,1
> > > wordlist r,word,wordlist.db,2
> >
> > good
> >
> > > Question 1:
> > > Do entries in /etc/bogofilter.cf override default settings
> > > in /etc/bogofilter.cf.example?
> >
> > bogofilter.cf.example is only an example.  It is not used by
> > bogofilter.
> >
> > > Next I created a file named ignore_list.txt, and put the full
> > > headers from one of my Debian list emails within.
> > >
> > > Now I ran the following command.
> > > [djmons at localhost djmons]$ bogoutil -l ~/.bogofilter/ignore.db <
> > > ignore_list.txt
> > > bogoutil: Unexpected input [ Received:] on line 2. Expecting
> > > whitespace before count.
> > > read or write error, aborting.
> > > [djmons at localhost djmons]$
> >
> > bogoutil expects lines containing 1 token, 2 counts, and a
> > timestamp. It isn't smart enough to parse real headers.
> >
> > You could use the following to parse and import in a single command:
> >
> >    bogolexer < message.headers | bogoutil -l ignore.db
> 
> Ok, but I'm still rather clueless here. Anyway I 've run the stuff
> below.
> 
> [djmons at localhost djmons]$ bogolexer < ignore_list.txt | bogoutil -l 
> ~/.bogofilter/ignore.db
> [djmons at localhost djmons]$ bogoutil -d .bogofilter/ignore.db
> 195 0 0 20080126
> get_token: 220 0 20080126
> normal 0 0 20080126
> [djmons at localhost djmons]$
> 
> The ignore_list.txt above is the full headers from a Debian mailing
> list email.
> 
> Does that output above look any better?

It looks better, though not quite right -- due to a flag I forgot to
include, i.e. "-p".  Use

   rm ignore.db
  bogolexer -p < message.headers | bogoutil -l ignore.db

The output should list each token from the headers and the counts
should be "0 0 YYYYMMDD".


> On an earlier post you said:
> Quote:
> What you _could_ do is create an ignore list with headers from the
> debian list.  This would eliminate those tokens from the scoring
> effectively telling bogofilter to score using only body tokens.
> 
> This is still what I'm looking for. It's not too easy to test the 
> effectiveness of the ignore.db at the moment, as I only get the odd
> spam email from the Debian list, but if I can get it to work, it will
> be a job well done.

Look at the FAQ for info on bogofilter's "-vvv" flags which tell
bogofilter to display each token and its ham/spam counts and spamicity
score.  With a test message, save the "-vvv" results before and after
creating the  ignorelist, and then compare them.  You should see a
difference in the final score as well as the header tokens.  The last
column of the "-vvv" output has a "+" for tokens used in scoring the
message, a "i" for tokens in the ignore list, and a "-" for tokens near
0.5 that are not used in scoring the message

HTH,

David



More information about the Bogofilter mailing list