bogofilter resistant email
Tom Anderson
tanderso at oac-design.com
Fri Feb 13 16:06:38 CET 2004
On Thu, 2004-02-12 at 07:31, David Relson wrote:
> With my wordlist and bogofilter's default parameters I get this result:
> X-Bogosity: No, tests=bogofilter, spamicity=0.270366, version=0.17.1
> Setting min_dev to my normal site value of 0.435, I get a result similar
> to yours:
> X-Bogosity: Unsure, tests=bogofilter, spamicity=0.500000, version=0.17.1
In both of these instances, you're confirming my observation that this
spam message will not be classified as spam. Here's my histogram:
[tanderso at www .bogofilter]$ bogofilter -vv <boss.eml
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.500000, version=0.16.0
int cnt prob spamicity histogram
0.00 67 0.047401 0.021184
##############################################
0.10 64 0.150372 0.065109
############################################
0.20 43 0.228798 0.102889 ##############################
0.30 0 0.000000 0.102889
0.40 0 0.000000 0.102889
0.50 0 0.000000 0.102889
0.60 0 0.000000 0.102889
0.70 5 0.768982 0.127545 ####
0.80 12 0.836851 0.190418 #########
0.90 71 0.986458 0.499029
################################################
On Thu, 2004-02-12 at 07:37, Boris 'pi' Piwinger wrote:
> Just to add to that. You say, training does not change the
> bogosity. Do the following:
>
> bogofilter -vvv <boss.eml >before
> bogofilter -sv <boss.eml
> bogofilter -vvv <boss.eml >after
> diff before after
Ok, here is a small but representative subset:
[tanderso at www .bogofilter]$ diff before after |grep "the"
< "these" 14196 0.195133 0.059322 0.233135 +
< "there" 18137 0.170138 0.077400 0.312680 -
< "then" 15888 0.147775 0.067828 0.314599 -
< "their" 19260 0.178250 0.082242 0.315718 -
< "they" 19639 0.165972 0.084181 0.336519 -
< "the" 125539 0.870862 0.541979 0.383610 -
> "these" 14197 0.195133 0.059326 0.233148 +
> "there" 18138 0.170138 0.077404 0.312692 -
> "then" 15889 0.147775 0.067832 0.314612 -
> "their" 19261 0.178250 0.082246 0.315728 -
> "they" 19640 0.165972 0.084185 0.336529 -
> "the" 125540 0.870862 0.541981 0.383610 -
Because of the weight of the hammish tokens and the large number of them
in this spam, re-registering it many times does very little to the
overall spamicity.
Here's another subset that contains some new words too:
[tanderso at www .bogofilter]$ diff before after |grep "tt"
< "letters" 625 0.011182 0.002559 0.186280 +
< "getting" 11310 0.078930 0.048818 0.382144 -
< "attention" 1916 0.011840 0.008301 0.412166 -
< "letter" 5069 0.015567 0.022282 0.588714 -
< "lottery" 255 0.000439 0.001128 0.719962 -
< "Trotting" 5 0.000000 0.000022 0.989314 +
< "letter!" 5 0.000000 0.000022 0.989314 +
< "lotteries" 5 0.000000 0.000022 0.989314 +
< "lottorey" 5 0.000000 0.000022 0.989314 +
> "letters" 626 0.011182 0.002563 0.186543 +
> "getting" 11311 0.078930 0.048822 0.382164 -
> "attention" 1917 0.011840 0.008306 0.412295 -
> "letter" 5070 0.015567 0.022287 0.588761 -
> "lottery" 256 0.000439 0.001132 0.720756 -
> "Trotting" 6 0.000000 0.000027 0.991066 +
> "letter!" 6 0.000000 0.000027 0.991066 +
> "lotteries" 6 0.000000 0.000027 0.991066 +
> "lottorey" 6 0.000000 0.000027 0.991066 +
I don't think the spammishness of the low count words can compete. And
if I re-register enough times so that the ham words are more spammy,
then I fear getting false positives.
Tom
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20040213/f8a1c329/attachment.sig>
More information about the Bogofilter
mailing list