bogofilter resistant email

David Relson relson at osagesoftware.com
Thu Feb 12 13:31:53 CET 2004


On 12 Feb 2004 02:12:31 -0500
Tom Anderson wrote:

> The attached email came in as unsure with a spamicity of 0.493192.  I
> registered it as spam, and it increased to 0.500000.  I've registered
> it five more times, and the spamicity remains at 0.500000.  I think
> perhaps the sheer number of common hammish words are the culprit.  Has
> anyone else gotten impossible to filter emails like this?  It differs
> from the"random word" emails because most of the words in this email
> are common whereas the random words are usually unique.  I fear that
> registering this one too much will distort my database overall.  I
> wonder if giving more weight to the header tokens would be a good
> idea.
> 
> Tom

Tom,

With my wordlist and bogofilter's default parameters I get this result:

[relson at osage TomAnderson]$ bogofilter -vv <boss.eml
X-Bogosity: No, tests=bogofilter, spamicity=0.270366, version=0.17.1
   int  cnt   prob  spamicity histogram
  0.00   15 0.025736 0.005405 #########
  0.10   35 0.154833 0.049123 ####################
  0.20   62 0.249176 0.119907 ####################################
  0.30   84 0.354094 0.209903
################################################
  0.40    0 0.000000 0.209903 
  0.50    0 0.000000 0.209903 
  0.60   52 0.645377 0.308902 ##############################
  0.70   48 0.750895 0.387440 ############################
  0.80   26 0.839908 0.427230 ###############
  0.90   26 0.969125 0.490234 ###############

Setting min_dev to my normal site value of 0.435, I get a result similar
to yours:

[relson at osage TomAnderson]$ bogofilter -vv < boss.eml -o0.501,0.4
-m0.435
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.500000, version=0.17.1
   int  cnt   prob  spamicity histogram
  0.00   11 0.006303 0.002340 ###########
  0.10    0 0.000000 0.002340 
  0.20    0 0.000000 0.002340 
  0.30    0 0.000000 0.002340 
  0.40    0 0.000000 0.002340 
  0.50    0 0.000000 0.002340 
  0.60    0 0.000000 0.002340 
  0.70    0 0.000000 0.002340 
  0.80    0 0.000000 0.002340 
  0.90   23 0.977058 0.518776 #######################

As pi suggests, run with "-vv" and "-vvv" and bogofilter will give you
more info about the important tokens in the message.

I've seen many instances of Unsure=0.500000.  Like this case, they often
have significantly different numbers of hammish and spammish tokens.
I've dug into the code and have a patch that displays additional detail
about the numbers used in Robinson's equations and in the Fisher
chi-square calculation.  I'll have to get out that patch and run it. 
Due to work pressures it may be a few days before I can get to it.

David


David




More information about the Bogofilter mailing list