bogofilter resistant email
David Relson
relson at osagesoftware.com
Thu Feb 12 13:31:53 CET 2004
On 12 Feb 2004 02:12:31 -0500
Tom Anderson wrote:
> The attached email came in as unsure with a spamicity of 0.493192. I
> registered it as spam, and it increased to 0.500000. I've registered
> it five more times, and the spamicity remains at 0.500000. I think
> perhaps the sheer number of common hammish words are the culprit. Has
> anyone else gotten impossible to filter emails like this? It differs
> from the"random word" emails because most of the words in this email
> are common whereas the random words are usually unique. I fear that
> registering this one too much will distort my database overall. I
> wonder if giving more weight to the header tokens would be a good
> idea.
>
> Tom
Tom,
With my wordlist and bogofilter's default parameters I get this result:
[relson at osage TomAnderson]$ bogofilter -vv <boss.eml
X-Bogosity: No, tests=bogofilter, spamicity=0.270366, version=0.17.1
int cnt prob spamicity histogram
0.00 15 0.025736 0.005405 #########
0.10 35 0.154833 0.049123 ####################
0.20 62 0.249176 0.119907 ####################################
0.30 84 0.354094 0.209903
################################################
0.40 0 0.000000 0.209903
0.50 0 0.000000 0.209903
0.60 52 0.645377 0.308902 ##############################
0.70 48 0.750895 0.387440 ############################
0.80 26 0.839908 0.427230 ###############
0.90 26 0.969125 0.490234 ###############
Setting min_dev to my normal site value of 0.435, I get a result similar
to yours:
[relson at osage TomAnderson]$ bogofilter -vv < boss.eml -o0.501,0.4
-m0.435
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.500000, version=0.17.1
int cnt prob spamicity histogram
0.00 11 0.006303 0.002340 ###########
0.10 0 0.000000 0.002340
0.20 0 0.000000 0.002340
0.30 0 0.000000 0.002340
0.40 0 0.000000 0.002340
0.50 0 0.000000 0.002340
0.60 0 0.000000 0.002340
0.70 0 0.000000 0.002340
0.80 0 0.000000 0.002340
0.90 23 0.977058 0.518776 #######################
As pi suggests, run with "-vv" and "-vvv" and bogofilter will give you
more info about the important tokens in the message.
I've seen many instances of Unsure=0.500000. Like this case, they often
have significantly different numbers of hammish and spammish tokens.
I've dug into the code and have a patch that displays additional detail
about the numbers used in Robinson's equations and in the Fisher
chi-square calculation. I'll have to get out that patch and run it.
Due to work pressures it may be a few days before I can get to it.
David
David
More information about the Bogofilter
mailing list