Detecting false-positives
David Relson
relson at osagesoftware.com
Sun May 22 23:15:36 CEST 2005
On Sun, 22 May 2005 16:59:46 -0400
Dwayne Hottinger wrote:
> On a little different stroke, I've been getting plenty of the Nazi spam also.
> Some of which appears to have email addresses inside my domain. I have a setup
> where users report email as spam and the reported mail gets sent to a mail box.
> Once a week or more I go through these messages and run a little script that
> feeds them into bogofilters wordlist. I've been hesitant to let the ones with
> email addresses inside my domain go into the wordlist for fear bogofilter will
> start grabbing those emails as spam also. Should I go ahead and dump those
> emails into bogo's wordlist with no fear that it will corrupt my wordlist?
>
> ddh
Hi Dwayne,
Yes. Include them as well.
Remember each token has both spam and ham counts. The token's score is (roughly)
spam / ( spam + ham)
For any email address, the score for that address will depend on its
ratio of spam to ham. Here're some of _my_ numbers:
bogoutil -d wordlist.db | grep -w relson | awk '{print $1}' | bogoutil -p wordlist.db | sort +3
from:david.relson 0 12070 0.000001
head:david.relson 0 12070 0.000001
rtrn:david.relson 0 12070 0.000001
to:david.relson 0 13 0.000711
david.relson 14 597 0.019109
...
from:relson.osagesoftware.com 1080 0 0.999992
head:relson.osagesoftware.com 1080 0 0.999992
rtrn:relson.osagesoftware.com 1080 0 0.999992
relson.com 1440 0 0.999994
relson.net 1440 0 0.999994
relson.org 1440 0 0.999994
head:bounce-#-relson 3463 0 0.999998
relson!osagesoftware.com 6480 0 0.999999
As you can see, most of the tokens are used solely as ham or as spam.
My name has a spam count of 14 and a ham count of 597. The spam count
moves it away for 0.00001, but not very far.
HTH,
David
More information about the Bogofilter
mailing list