preliminary ignore results
David Relson
relson at osagesoftware.com
Mon May 17 23:50:37 CEST 2004
Greetings,
A couple of the lists I subscribe to (or used to) get spammed regularly.
I've saved a number of the messages that bogofilter had trouble with.
The "trouble" was caused by the message's list headers (seen as hammish)
and the bodies (seen as spammish). The headers and bodies tend to
balance one another and the messages tend to score right around 0.5.
I've got a total of 280 such messages. From them I found all header
tokens that occurred 100 times or more. There were 71 of them. I used
them to create the ignorelist. The commands for doing this were
(roughly):
bogofilter -d test -s test.d/msg.*
bogoutil -d test/wordlist.db > test/wordlist.txt
egrep -w [12][0-9][0-9] test/wordlist.txt | grep ":" | \
egrep -v "^(rcvd:|head:)[0-9]" | \
egrep -v "(mime:|subj:|relson|osagesoftware.com|rcvd:.*\-)" | \
tee test/ignorelist.txt | wc -l
bogoutil -l test/ignorelist.db < test/ignorelist.txt
The essential lines of the config files are:
current.cf
wordlist=r,wordlist,./test/wordlist.db,1
wordlist=i,ignore,./test/ignorelist.db,2
ignore.cf
wordlist=i,ignore,./test/ignorelist.db,1
wordlist=r,wordlist,./test/wordlist.db,2
Here are the classification counts:
Ham Spam Unsure
original: 47 4 30
current: 2 67 208
ignore: 3 264 10
Notes:
original - counts based on original message classification (some
months ago).
current - counts based on classifications using current wordlist as
#1. Making the ignore list #2, effectively makes it unused.
ignore - counts based on classifications using ignore list as
#1. With the current wordlist as #2, bogofilter occurs the 71
tokens
in the ignorelist.
The code still needs some cleaning and organizing. When that's done,
I'll commit it to cvs.
David
More information about the Bogofilter
mailing list