preliminary ignore results
relson at osagesoftware.com
Mon May 17 17:50:37 EDT 2004
A couple of the lists I subscribe to (or used to) get spammed regularly.
I've saved a number of the messages that bogofilter had trouble with.
The "trouble" was caused by the message's list headers (seen as hammish)
and the bodies (seen as spammish). The headers and bodies tend to
balance one another and the messages tend to score right around 0.5.
I've got a total of 280 such messages. From them I found all header
tokens that occurred 100 times or more. There were 71 of them. I used
them to create the ignorelist. The commands for doing this were
bogofilter -d test -s test.d/msg.*
bogoutil -d test/wordlist.db > test/wordlist.txt
egrep -w [0-9][0-9] test/wordlist.txt | grep ":" | \
egrep -v "^(rcvd:|head:)[0-9]" | \
egrep -v "(mime:|subj:|relson|osagesoftware.com|rcvd:.*\-)" | \
tee test/ignorelist.txt | wc -l
bogoutil -l test/ignorelist.db < test/ignorelist.txt
The essential lines of the config files are:
Here are the classification counts:
Ham Spam Unsure
original: 47 4 30
current: 2 67 208
ignore: 3 264 10
original - counts based on original message classification (some
current - counts based on classifications using current wordlist as
#1. Making the ignore list #2, effectively makes it unused.
ignore - counts based on classifications using ignore list as
#1. With the current wordlist as #2, bogofilter occurs the 71
in the ignorelist.
The code still needs some cleaning and organizing. When that's done,
I'll commit it to cvs.
More information about the Bogofilter