Detecting false-positives

David Relson relson at osagesoftware.com
Sun May 22 23:15:36 CEST 2005


On Sun, 22 May 2005 16:59:46 -0400
Dwayne Hottinger wrote:

> On a little different stroke, I've been getting plenty of the Nazi spam also. 
> Some of which appears to have email addresses inside my domain.  I have a setup
> where users report email as spam and the reported mail gets sent to a mail box.
>  Once a week or more I go through these messages and run a little script that
> feeds them into bogofilters wordlist.  I've been hesitant to let the ones with
> email addresses inside my domain go into the wordlist for fear bogofilter will
> start grabbing those emails as spam also.  Should I go ahead and dump those
> emails into bogo's wordlist with no fear that it will corrupt my wordlist?
> 
> ddh

Hi Dwayne,

Yes.  Include them as well.

Remember each token has both spam and ham counts.  The token's score is (roughly)

   spam / ( spam + ham)

For any email address, the score for that address will depend on its
ratio of spam to ham.  Here're some of _my_ numbers:

bogoutil -d wordlist.db | grep -w relson | awk '{print $1}' | bogoutil -p wordlist.db | sort +3

from:david.relson                   0   12070  0.000001
head:david.relson                   0   12070  0.000001
rtrn:david.relson                   0   12070  0.000001
to:david.relson                     0      13  0.000711
david.relson                       14     597  0.019109
...
from:relson.osagesoftware.com    1080       0  0.999992
head:relson.osagesoftware.com    1080       0  0.999992
rtrn:relson.osagesoftware.com    1080       0  0.999992
relson.com                       1440       0  0.999994
relson.net                       1440       0  0.999994
relson.org                       1440       0  0.999994
head:bounce-#-relson             3463       0  0.999998
relson!osagesoftware.com         6480       0  0.999999

As you can see, most of the tokens are used solely as ham or as spam.
My name has a spam count of 14 and a ham count of 597.  The spam count
moves it away for 0.00001, but not very far.

HTH,

David




More information about the Bogofilter mailing list