Minimum usable counts [was: Question]

Stephen Davies scldad at sdc.com.au
Mon May 25 04:00:32 CEST 2009


I have been running with this patch for several days now and think that it may 
be a good idea.

However, it's more immediate benefit has been to expose a more significant 
issue that is nothing immediately to do with bogofilter but does seem 
relevant.

The way my filtering works is that I start with sendmail with an access db 
plus a modified rule set to reject unknown addressees. This reduces the spam 
volume considerably.
Mail that gets past those checks goes to amavisd via milter. Amavis uses 
clamav and bogofilter plus it's usual bad header etc checks.

Finally, mail is delivered to me via kmail.

The gotcha that I have just discovered is that the mail as delivered by kmail 
is not identical to that checked by amavis/bogofilter.
So the false ham text that I feed back to bogofilter -Ns is not identical to 
the original that went through bogofilter -n with the obvious effects.

For example, the headers on one mail as seen by amavis/bogofilter were:

Received: from 189-19-129-170.dsl.telesp.net.br 
(189-19-129-170.dsl.telesp.net.br [189.19.129.170])
        by localhost (amavisd-milter);
        Sun, 24 May 2009 13:56:15 +0930 (CST)
        (envelope-from <tequilla09 at hotmail.com>)
Received: from 189.19.129.170 by mx1.hotmail.com; Sun, 24 May 2009 
01:25:53 -0300
Message-ID: <000d01c9dc27$b9f73130$6400a8c0 at tequilla09>
From: "Katie Bowling" <tequilla09 at hotmail.com>
To: <scldad at sdc.com.au>
Subject: $159.95 Viagra 100mg x 90 pills price
Date: Sun, 24 May 2009 01:25:53 -0300
MIME-Version: 1.0
Content-Type: multipart/alternative;
        boundary="----=_NextPart_000_0007_01C9DC27.B9F73130"
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2800.1506
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1506


The same headers as seen from kmail were:

From tequilla09 at hotmail.com Sun May 24 13:55:53 2009
Return-Path: <tequilla09 at hotmail.com>
X-Virus-Scanned: amavisd-new at sdc.com.au
Received: from 189-19-129-170.dsl.telesp.net.br 
(189-19-129-170.dsl.telesp.net.br [189.19.129.170])
        by mustang.sdc.com.au (8.14.3/8.14.2) with ESMTP id n4O4QEvT002192
        for <scldad at sdc.com.au>; Sun, 24 May 2009 13:56:15 +0930
Received: from 189.19.129.170 by mx1.hotmail.com; Sun, 24 May 2009 
01:25:53 -0300
Message-ID: <000d01c9dc27$b9f73130$6400a8c0 at tequilla09>
From: "Katie Bowling" <tequilla09 at hotmail.com>
To: <scldad at sdc.com.au>
Subject: $159.95 Viagra 100mg x 90 pills price
Date: Sun, 24 May 2009 01:25:53 -0300
MIME-Version: 1.0
Content-Type: multipart/alternative;
  boundary="----=_NextPart_000_0007_01C9DC27.B9F73130"
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2800.1506
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1506
X-UIDL: &[g!!PX8!!%T4"!FBm!!
Status: R
X-Status: NC
X-KMail-EncryptionState:
X-KMail-SignatureState:
X-KMail-MDN-Sent:

I have a number of entries in my ignore.db that reduce the differences but 
leave:

Received: from 189-19-129-170.dsl.telesp.net.br 
(189-19-129-170.dsl.telesp.net.br [189.19.129.170])
        Sun, 24 May 2009 13:56:15 +0930 (CST)
        (envelope-from <tequilla09 at hotmail.com>)

as unique to the kmail version.

Even more significant is the fact that most incoming mail contains CR 
characters (ex microsoft). The text as examined by amavisd/bogofilter has 
these characters and, therefore, gives different tokens from those seen ex 
KMail where the CRs have been stripped.

This is illustrated by the first line in the original analysis output.
The token "ere" actually comes from "h<CR>ere" in the original text.
In the KMail version, this has become "here".

This process skews the training and probably explains my original dilemma.

I am now going to try amending the amavis/bogofilter check to remove the CRs.

The two sets of results are:

Original:

X-Bogosity: Ham, tests=bogofilter, spamicity=0.897798, version=1.2.0
                                        n    pgood     pbad      fw     U
  "ere"                               559  0.030850  0.000452  0.014712 +
  "rcvd:hotmail.com"                  462  0.014221  0.000828  0.055347 +
  "rcvd:tequilla09"                     1  0.000075  0.000000  0.126230 -
  "group"                            5938  0.069375  0.015219  0.179927 -
  "per"                             12917  0.102107  0.035075  0.255688 -
  "url:189.19"                         87  0.000602  0.000240  0.285840 -
  "ill"                              1317  0.007976  0.003674  0.315451 -
  "from:Bowling"                       62  0.000376  0.000173  0.316285 -
  "rcvd:Sun"                        24917  0.105643  0.071342  0.403097 -
  "rcvd:May"                        84596  0.348834  0.242610  0.410200 -
  "This"                           199836  0.776749  0.575008  0.425379 -
  "message"                        173495  0.672761  0.499279  0.425992 -
  "from:hotmail.com"                 5871  0.022498  0.016906  0.429055 -
  "rcvd:dsl.telesp.net.br"           1874  0.006998  0.005404  0.435767 -
  "rcvd:from"                      193482  0.697743  0.558915  0.444764 -
  "url:189"                          6327  0.022649  0.018284  0.446690 -
  "subj:$159.95"                      193  0.000677  0.000558  0.452156 -
  "head:Content-Type"              218242  0.629571  0.636790  0.502850 -
  "head:Date"                      231168  0.654778  0.674992  0.507601 -
  "subj:pills"                       7351  0.018284  0.021567  0.541186 -
  "mime:Content-Type"              142216  0.350941  0.417352  0.543220 -
  "mime:plain"                     138487  0.337773  0.406568  0.546213 -
  "mime:charset"                   138670  0.337773  0.407124  0.546551 -
  "mime:Content-Transfer-Encoding"  139977  0.338074  0.411077  0.548724 -
  "mime:quoted-printable"          121981  0.293454  0.358274  0.549730 -
  "head:multipart"                 144955  0.346953  0.425823  0.551031 -
  "mime:text"                      142951  0.340030  0.420022  0.552623 -
  "mime:html"                      138645  0.321294  0.407712  0.559271 -
  "nbsp"                            92425  0.212039  0.271880  0.561830 -
  "format"                         161387  0.350865  0.475522  0.575423 -
  "head:MIME-Version"              209188  0.452370  0.616464  0.576763 -
  "head:Message-ID"                204221  0.434312  0.602121  0.580956 -
  "head:X-MimeOLE"                 157176  0.325433  0.463771  0.587644 -
  "head:Produced"                  157752  0.326185  0.465488  0.587980 -
  "head:X-Mailer"                  188673  0.388713  0.556785  0.588880 -
  "head:Express"                   127124  0.261851  0.375153  0.588934 -
  "head:alternative"               120098  0.247028  0.354433  0.589287 -
  "head:Microsoft"                 159016  0.327013  0.469290  0.589336 -
  "head:MimeOLE"                   157135  0.319789  0.463874  0.591930 -
  "head:Normal"                    167929  0.334387  0.496036  0.597330 -
  "here"                            62721  0.123326  0.185331  0.600444 -
  "head:X-Priority"                171919  0.333785  0.508166  0.603558 -
  "rcvd:mx1.hotmail.com"              195  0.000376  0.000576  0.605209 -
  "subj:price"                       2930  0.005568  0.008665  0.608813 -
  "MIME"                           155520  0.292024  0.460094  0.611731 -
  "head:X-MSMail-Priority"         135589  0.254026  0.401152  0.612280 -
  "head:Outlook"                   148022  0.276674  0.437963  0.612847 -
  "mime:iso-8859-2"                  8792  0.015651  0.026045  0.624644 -
  "multi-part"                     152764  0.261625  0.452957  0.633877 -
  "pill"                             6816  0.010760  0.020247  0.652981 -
  "pills"                           14090  0.022197  0.041856  0.653458 -
  "Arial"                          114620  0.179609  0.340530  0.654691 -
  "groups.yahoo.com"                 1008  0.001505  0.002998  0.665782 -
  "face"                           139095  0.196464  0.414111  0.678232 -
  "to:sdc.com.au"                  342934  0.441610  1.022701  0.698418 -
  "rcvd:CST"                            0  --------  --------  0.700000 i
  "rcvd:amavisd-milter"                 0  --------  --------  0.700000 i
  "rcvd:envelope-from"                  0  --------  --------  0.700000 i
  "rcvd:localhost"                      0  --------  --------  0.700000 i
  "http"                           282873  0.357863  0.843845  0.702205 -
  "href"                           193966  0.244545  0.578658  0.702935 -
  "size"                           152427  0.190369  0.454808  0.704936 -
  "from:Katie"                         64  0.000075  0.000191  0.717485 -
  "Visit"                           19174  0.010685  0.057746  0.843858 -
  "head:V6.00.2800.1506"             1936  0.000828  0.005841  0.875859 +
  "to:scldad"                      163818  0.013168  0.496515  0.974164 +
  "$6.00"                               5  0.000000  0.000015  0.987356 +
  "from:tequilla09"                     5  0.000000  0.000015  0.987356 +
  "url:189.19.129"                      5  0.000000  0.000015  0.987356 +
  "url:189.19.129.170"                  5  0.000000  0.000015  0.987356 +
  "ganebawusexut64"                    10  0.000000  0.000030  0.993542 +
  "subj:Viagra"                     95457  0.000000  0.289630  0.999999 +
  N_P_Q_S_s_x_md                       10  0.000000  0.795596  0.897798
                                           0.220000  0.700000  0.375000
Ex KMail:

X-Bogosity: Spam, tests=bogofilter, spamicity=1.000000, version=1.2.0
                                        n    pgood     pbad      fw     U
  "group"                            5938  0.069349  0.015220  0.179993 -
  "per"                             12917  0.102068  0.035077  0.255773 -
  "url:189.19"                         87  0.000602  0.000240  0.285931 -
  "from:Bowling"                       62  0.000376  0.000173  0.316381 -
  "rcvd:Sun"                        24914  0.105679  0.071335  0.402993 -
  "rcvd:May"                        84577  0.349079  0.242554  0.409975 -
  "This"                           199817  0.776833  0.574977  0.425339 -
  "message"                        173476  0.672885  0.499243  0.425929 -
  "from:hotmail.com"                 5871  0.022490  0.016907  0.429165 -
  "rcvd:dsl.telesp.net.br"           1874  0.006995  0.005404  0.435877 -
  "rcvd:from"                      193463  0.697856  0.558883  0.444709 -
  "url:189"                          6327  0.022640  0.018285  0.446801 -
  "subj:$159.95"                      193  0.000677  0.000558  0.452267 -
  "head:Content-Type"              218223  0.629710  0.636763  0.502785 -
  "head:Date"                      231149  0.654908  0.674969  0.507543 -
  "subj:pills"                       7351  0.018278  0.021568  0.541297 -
  "mime:Content-Type"              142197  0.351185  0.417309  0.543022 -
  "mime:plain"                     138468  0.338022  0.406525  0.546004 -
  "mime:charset"                   138651  0.338022  0.407080  0.546342 -
  "mime:Content-Transfer-Encoding"  139958  0.338323  0.411034  0.548516 -
  "mime:quoted-printable"          121962  0.293719  0.358227  0.549473 -
  "head:multipart"                 144936  0.347198  0.425781  0.550831 -
  "mime:text"                      142932  0.340278  0.419979  0.552417 -
  "mime:html"                      138626  0.321549  0.407669  0.559049 -
  "nbsp"                            92409  0.212260  0.271839  0.561536 -
  "format"                         161372  0.351034  0.475499  0.575293 -
  "head:MIME-Version"              209169  0.452576  0.616436  0.576641 -
  "head:Message-ID"                204202  0.434524  0.602092  0.580825 -
  "head:X-MimeOLE"                 157161  0.325611  0.463747  0.587499 -
  "head:Produced"                  157737  0.326363  0.465464  0.587836 -
  "head:Express"                   127109  0.262053  0.375123  0.588727 -
  "head:X-Mailer"                  188658  0.388868  0.556768  0.588776 -
  "head:alternative"               120079  0.247311  0.354386  0.588977 -
  "head:Microsoft"                 159001  0.327191  0.469267  0.589193 -
  "head:MimeOLE"                   157120  0.319970  0.463850  0.591782 -
  "head:Normal"                    167914  0.334562  0.496014  0.597193 -
  "here"                            62717  0.123355  0.185329  0.600386 -
  "head:X-Priority"                171904  0.333960  0.508146  0.603423 -
  "rcvd:mx1.hotmail.com"              195  0.000376  0.000577  0.605316 -
  "subj:price"                       2930  0.005566  0.008666  0.608920 -
  "MIME"                           155505  0.292215  0.460069  0.611563 -
  "head:X-MSMail-Priority"         135574  0.254231  0.401124  0.612071 -
  "head:Outlook"                   148007  0.276871  0.437937  0.612664 -
  "mime:iso-8859-2"                  8792  0.015645  0.026047  0.624750 -
  "multi-part"                     152749  0.261828  0.452933  0.633685 -
  "pill"                             6816  0.010756  0.020248  0.653083 -
  "pills"                           14090  0.022189  0.041859  0.653559 -
  "Arial"                          114604  0.179842  0.340494  0.654374 -
  "groups.yahoo.com"                 1008  0.001504  0.002998  0.665882 -
  "face"                           139079  0.196690  0.414081  0.677964 -
  "to:sdc.com.au"                  342915  0.441820  1.022703  0.698318 -
  "head:From"                           0  --------  --------  0.700000 i
  "head:May"                            0  --------  --------  0.700000 i
  "head:Status"                         0  --------  --------  0.700000 i
  "head:X-KMail-EncryptionState"        0  --------  --------  0.700000 i
  "head:X-KMail-MDN-Sent"               0  --------  --------  0.700000 i
  "head:X-KMail-SignatureState"         0  --------  --------  0.700000 i
  "head:X-Status"                       0  --------  --------  0.700000 i
  "head:X-Virus-Scanned"                0  --------  --------  0.700000 i
  "head:amavisd-new"                    0  --------  --------  0.700000 i
  "head:sdc.com.au"                     0  --------  --------  0.700000 i
  "head:tequilla09"                     0  0.000000  0.000000  0.700000 -
  "rcvd:ESMTP"                          0  --------  --------  0.700000 i
  "rcvd:mustang.sdc.com.au"             0  --------  --------  0.700000 i
  "rcvd:scldad"                         0  --------  --------  0.700000 i
  "http"                           282858  0.358029  0.843849  0.702108 -
  "href"                           193951  0.244754  0.578643  0.702751 -
  "size"                           152411  0.190598  0.454780  0.704673 -
  "from:Katie"                         64  0.000075  0.000191  0.717576 -
  "head:hotmail.com"                  566  0.000527  0.001696  0.763097 -
  "Visit"                           19174  0.010681  0.057750  0.843918 -
  "head:V6.00.2800.1506"             1936  0.000827  0.005841  0.875908 +
  "rtrn:hotmail.com"                  881  0.000301  0.002661  0.898375 +
  "head:Sun"                         5056  0.000602  0.015317  0.962190 +
  "to:scldad"                      163818  0.013163  0.496551  0.974176 +
  "$6.00"                               5  0.000000  0.000015  0.987356 +
  "from:tequilla09"                     5  0.000000  0.000015  0.987356 +
  "head:FBm!!"                          5  0.000000  0.000015  0.987356 +
  "head:g!!PX8!!"                       5  0.000000  0.000015  0.987356 +
  "rtrn:tequilla09"                     5  0.000000  0.000015  0.987356 +
  "url:189.19.129"                      5  0.000000  0.000015  0.987356 +
  "url:189.19.129.170"                  5  0.000000  0.000015  0.987356 +
  "ganebawusexut64"                    10  0.000000  0.000030  0.993542 +
  "head:X-UIDL"                     52472  0.000000  0.159219  0.999999 +
  "subj:Viagra"                     95457  0.000000  0.289651  0.999999 +
  "rcvd:sdc.com.au"                106072  0.000000  0.321860  0.999999 +
  "rcvd:for"                       110646  0.000000  0.335740  0.999999 +
  "rcvd:with"                      139985  0.000000  0.424765  1.000000 +
  N_P_Q_S_s_x_md                       17  0.000000  1.000000  1.000000
                                           0.220000  0.700000  0.375000


 On Thursday 21 May 2009 12:46:05 David Relson wrote:
> On Thu, 21 May 2009 12:01:41 +0930
>
> Stephen Davies wrote:
> > I understand.
> >
> > My initial issue is with the obvious spams not being detected first
> > time round.
> > The first I see of them is in my inbox as ham - despite being so
> > obviously spam.
> >
> > If I save the email and run it through bogofilter -vvv, I get the
> > results I posted.
> >
> > I then use bogofilter -Ns to "fix" the database and this seems to
> > work - until the next spam with the same pattern but from a different
> > source arrives. (bogofilter -vvv at this stage gives bogosity of 1.0).
> >
> > I have changed my min-dev, robx and robs to 0.35, 0.7, 0.1 but first
> > indications are that this is not enough.
>
> ...[snip]...
>
> Hi Stephen,
>
> 'Tis an interesting idea to allow not scoring tokens whose spam and ham
> counts are low.  As an experiment, the attached patch for src/score.c
> will ignore tokens for which good_count+bad_count<3.  Give it a try and
> let me know what you think of it.
>
> Regards,
>
> David
>
> P.S.  If the patch works for you, we'll need a good name for the
> option.  Any suggestions?



-- 
=============================================================================
Stephen Davies Consulting P/L                             Voice: 08-8177 1595
Adelaide, South Australia.                                Fax  : 08-8177 0133
Computing & Network solutions.                            Mobile:040 304 0583
                                          VoIP:sip:1132210 at sip1.bbpglobal.com



More information about the Bogofilter mailing list