Minimum usable counts [was: Question]

David Relson relson at osagesoftware.com
Mon May 25 05:37:21 CEST 2009


On Mon, 25 May 2009 11:30:32 +0930
Stephen Davies wrote:

> I have been running with this patch for several days now and think
> that it may be a good idea.
> 
> However, it's more immediate benefit has been to expose a more
> significant issue that is nothing immediately to do with bogofilter
> but does seem relevant.
> 
> The way my filtering works is that I start with sendmail with an
> access db plus a modified rule set to reject unknown addressees. This
> reduces the spam volume considerably.
> Mail that gets past those checks goes to amavisd via milter. Amavis
> uses clamav and bogofilter plus it's usual bad header etc checks.
> 
> Finally, mail is delivered to me via kmail.
> 
> The gotcha that I have just discovered is that the mail as delivered
> by kmail is not identical to that checked by amavis/bogofilter.
> So the false ham text that I feed back to bogofilter -Ns is not
> identical to the original that went through bogofilter -n with the
> obvious effects.
> 
> For example, the headers on one mail as seen by amavis/bogofilter
> were:
> 
> Received: from 189-19-129-170.dsl.telesp.net.br 
> (189-19-129-170.dsl.telesp.net.br [189.19.129.170])
>         by localhost (amavisd-milter);
>         Sun, 24 May 2009 13:56:15 +0930 (CST)
>         (envelope-from <tequilla09 at hotmail.com>)
> Received: from 189.19.129.170 by mx1.hotmail.com; Sun, 24 May 2009 
> 01:25:53 -0300
> Message-ID: <000d01c9dc27$b9f73130$6400a8c0 at tequilla09>
> From: "Katie Bowling" <tequilla09 at hotmail.com>
> To: <scldad at sdc.com.au>
> Subject: $159.95 Viagra 100mg x 90 pills price
> Date: Sun, 24 May 2009 01:25:53 -0300
> MIME-Version: 1.0
> Content-Type: multipart/alternative;
>         boundary="----=_NextPart_000_0007_01C9DC27.B9F73130"
> X-Priority: 3
> X-MSMail-Priority: Normal
> X-Mailer: Microsoft Outlook Express 6.00.2800.1506
> X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1506
> 
> 
> The same headers as seen from kmail were:
> 
> From tequilla09 at hotmail.com Sun May 24 13:55:53 2009
> Return-Path: <tequilla09 at hotmail.com>
> X-Virus-Scanned: amavisd-new at sdc.com.au
> Received: from 189-19-129-170.dsl.telesp.net.br 
> (189-19-129-170.dsl.telesp.net.br [189.19.129.170])
>         by mustang.sdc.com.au (8.14.3/8.14.2) with ESMTP id
> n4O4QEvT002192 for <scldad at sdc.com.au>; Sun, 24 May 2009 13:56:15
> +0930 Received: from 189.19.129.170 by mx1.hotmail.com; Sun, 24 May
> 2009 01:25:53 -0300
> Message-ID: <000d01c9dc27$b9f73130$6400a8c0 at tequilla09>
> From: "Katie Bowling" <tequilla09 at hotmail.com>
> To: <scldad at sdc.com.au>
> Subject: $159.95 Viagra 100mg x 90 pills price
> Date: Sun, 24 May 2009 01:25:53 -0300
> MIME-Version: 1.0
> Content-Type: multipart/alternative;
>   boundary="----=_NextPart_000_0007_01C9DC27.B9F73130"
> X-Priority: 3
> X-MSMail-Priority: Normal
> X-Mailer: Microsoft Outlook Express 6.00.2800.1506
> X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1506
> X-UIDL: &[g!!PX8!!%T4"!FBm!!
> Status: R
> X-Status: NC
> X-KMail-EncryptionState:
> X-KMail-SignatureState:
> X-KMail-MDN-Sent:
> 
> I have a number of entries in my ignore.db that reduce the
> differences but leave:
> 
> Received: from 189-19-129-170.dsl.telesp.net.br 
> (189-19-129-170.dsl.telesp.net.br [189.19.129.170])
>         Sun, 24 May 2009 13:56:15 +0930 (CST)
>         (envelope-from <tequilla09 at hotmail.com>)
> 
> as unique to the kmail version.
> 
> Even more significant is the fact that most incoming mail contains CR 
> characters (ex microsoft). The text as examined by amavisd/bogofilter
> has these characters and, therefore, gives different tokens from
> those seen ex KMail where the CRs have been stripped.
> 
> This is illustrated by the first line in the original analysis output.
> The token "ere" actually comes from "h<CR>ere" in the original text.
> In the KMail version, this has become "here".
> 
> This process skews the training and probably explains my original
> dilemma.
> 
> I am now going to try amending the amavis/bogofilter check to remove
> the CRs.
> 
> The two sets of results are:
> 
> Original:
> 
> X-Bogosity: Ham, tests=bogofilter, spamicity=0.897798, version=1.2.0

..[snip]...
> Ex KMail:
> 
> X-Bogosity: Spam, tests=bogofilter, spamicity=1.000000, version=1.2.0

..[snip]...

Stephen,

Very interesting!

Problem 1 - the rewriting of headers by amavis shouldn't matter a whole
lot.  The tokens introduced by amavis will AFAICT be in _all_ messages,
so their ham and spam counts and percentages will cause them to be
ignored.  However, that rewriting by amavis does cause a bit of info to
be lost which is sub-optimal.

Problem 2 - kmail strips <CR> chars allowing a word to be split so that
bogofilter misinterprets it -- an interesting problem.  One solution is
to filter the message and remove all <CR> chars before bogofilter
processing, i.e.

   cat message | tr -d "\r" | bogofilter ...

A quick review of RFC 2822 - Internet Message Format
( http://www.faqs.org/rfcs/rfc2822.html ) finds the following:  

   - CR and LF MUST only occur together as CRLF; they MUST NOT appear
     independently in the body.

So, having "h<CR>ere" is invalid.

I'll need to think on whether changing bogofilter to handle this is a
good idea.  Likely some experimentation is in order to check for side
effects if <CR> handling is changed.

Ciao,

David



More information about the Bogofilter mailing list