bogofilter resistant email

Tom Anderson tanderso at oac-design.com
Fri Feb 13 16:06:38 CET 2004


On Thu, 2004-02-12 at 07:31, David Relson wrote:
> With my wordlist and bogofilter's default parameters I get this result:
> X-Bogosity: No, tests=bogofilter, spamicity=0.270366, version=0.17.1

> Setting min_dev to my normal site value of 0.435, I get a result similar
> to yours:
> X-Bogosity: Unsure, tests=bogofilter, spamicity=0.500000, version=0.17.1

In both of these instances, you're confirming my observation that this
spam message will not be classified as spam.  Here's my histogram:

[tanderso at www .bogofilter]$ bogofilter -vv <boss.eml
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.500000, version=0.16.0
   int  cnt   prob  spamicity histogram
  0.00   67 0.047401 0.021184
##############################################
  0.10   64 0.150372 0.065109
############################################
  0.20   43 0.228798 0.102889 ##############################
  0.30    0 0.000000 0.102889 
  0.40    0 0.000000 0.102889 
  0.50    0 0.000000 0.102889 
  0.60    0 0.000000 0.102889 
  0.70    5 0.768982 0.127545 ####
  0.80   12 0.836851 0.190418 #########
  0.90   71 0.986458 0.499029
################################################


On Thu, 2004-02-12 at 07:37, Boris 'pi' Piwinger wrote: 
> Just to add to that. You say, training does not change the
> bogosity. Do the following:
> 
> bogofilter -vvv <boss.eml >before
> bogofilter -sv  <boss.eml
> bogofilter -vvv <boss.eml >after
> diff before after

Ok, here is a small but representative subset:

[tanderso at www .bogofilter]$ diff before after |grep "the"
< "these"                          14196  0.195133  0.059322  0.233135 +
< "there"                          18137  0.170138  0.077400  0.312680 -
< "then"                           15888  0.147775  0.067828  0.314599 -
< "their"                          19260  0.178250  0.082242  0.315718 -
< "they"                           19639  0.165972  0.084181  0.336519 -
< "the"                            125539 0.870862  0.541979  0.383610 -
> "these"                          14197  0.195133  0.059326  0.233148 +
> "there"                          18138  0.170138  0.077404  0.312692 -
> "then"                           15889  0.147775  0.067832  0.314612 -
> "their"                          19261  0.178250  0.082246  0.315728 -
> "they"                           19640  0.165972  0.084185  0.336529 -
> "the"                            125540 0.870862  0.541981  0.383610 -

Because of the weight of the hammish tokens and the large number of them
in this spam, re-registering it many times does very little to the
overall spamicity.

Here's another subset that contains some new words too:

[tanderso at www .bogofilter]$ diff before after |grep "tt"
< "letters"                          625  0.011182  0.002559  0.186280 +
< "getting"                        11310  0.078930  0.048818  0.382144 -
< "attention"                       1916  0.011840  0.008301  0.412166 -
< "letter"                          5069  0.015567  0.022282  0.588714 -
< "lottery"                          255  0.000439  0.001128  0.719962 -
< "Trotting"                           5  0.000000  0.000022  0.989314 +
< "letter!"                            5  0.000000  0.000022  0.989314 +
< "lotteries"                          5  0.000000  0.000022  0.989314 +
< "lottorey"                           5  0.000000  0.000022  0.989314 +
> "letters"                          626  0.011182  0.002563  0.186543 +
> "getting"                        11311  0.078930  0.048822  0.382164 -
> "attention"                       1917  0.011840  0.008306  0.412295 -
> "letter"                          5070  0.015567  0.022287  0.588761 -
> "lottery"                          256  0.000439  0.001132  0.720756 -
> "Trotting"                           6  0.000000  0.000027  0.991066 +
> "letter!"                            6  0.000000  0.000027  0.991066 +
> "lotteries"                          6  0.000000  0.000027  0.991066 +
> "lottorey"                           6  0.000000  0.000027  0.991066 +

I don't think the spammishness of the low count words can compete.  And
if I re-register enough times so that the ham words are more spammy,
then I fear getting false positives.

Tom

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20040213/f8a1c329/attachment.sig>


More information about the Bogofilter mailing list