Minimum usable counts [was: Question]

Stephen Davies scldad at sdc.com.au
Mon May 25 06:18:16 CEST 2009


I think I agree now that the amavis header rewrites are not a big deal but the 
CR thing certainly is.

However, I think you misread the detail here.
It is bogofilter that is seeing the CR and creating two tokens where the same 
text without the CR gives just one token.

AMavis presents:

<A href=3D"http://groups.yahoo.com/group/ganebawusexut64/message/1">Visit h=^M
ere</A></FONT></DIV>^M

while kmail presents:

<A href=3D"http://groups.yahoo.com/group/ganebawusexut64/message/1">Visit h=
ere</A></FONT></DIV>

It seems that bogofilter is smart enough to get "here" as a token from the 
kmail version but not the original.

I have already added the pipe through tr to my amavis code but it is too early 
to tell the effects.

Cheers,
Stephen

On Monday 25 May 2009 13:07:21 David Relson wrote:
> On Mon, 25 May 2009 11:30:32 +0930
>
> Stephen Davies wrote:
> > I have been running with this patch for several days now and think
> > that it may be a good idea.
> >
> > However, it's more immediate benefit has been to expose a more
> > significant issue that is nothing immediately to do with bogofilter
> > but does seem relevant.
> >
> > The way my filtering works is that I start with sendmail with an
> > access db plus a modified rule set to reject unknown addressees. This
> > reduces the spam volume considerably.
> > Mail that gets past those checks goes to amavisd via milter. Amavis
> > uses clamav and bogofilter plus it's usual bad header etc checks.
> >
> > Finally, mail is delivered to me via kmail.
> >
snip....
>
> ..[snip]...
>
> Stephen,
>
> Very interesting!
>
> Problem 1 - the rewriting of headers by amavis shouldn't matter a whole
> lot.  The tokens introduced by amavis will AFAICT be in _all_ messages,
> so their ham and spam counts and percentages will cause them to be
> ignored.  However, that rewriting by amavis does cause a bit of info to
> be lost which is sub-optimal.
>
> Problem 2 - kmail strips <CR> chars allowing a word to be split so that
> bogofilter misinterprets it -- an interesting problem.  One solution is
> to filter the message and remove all <CR> chars before bogofilter
> processing, i.e.
>
>    cat message | tr -d "\r" | bogofilter ...
>
> A quick review of RFC 2822 - Internet Message Format
> ( http://www.faqs.org/rfcs/rfc2822.html ) finds the following:
>
>    - CR and LF MUST only occur together as CRLF; they MUST NOT appear
>      independently in the body.
>
> So, having "h<CR>ere" is invalid.
>
> I'll need to think on whether changing bogofilter to handle this is a
> good idea.  Likely some experimentation is in order to check for side
> effects if <CR> handling is changed.
>
> Ciao,
>
> David



-- 
=============================================================================
Stephen Davies Consulting P/L                             Voice: 08-8177 1595
Adelaide, South Australia.                                Fax  : 08-8177 0133
Computing & Network solutions.                            Mobile:040 304 0583
                                          VoIP:sip:1132210 at sip1.bbpglobal.com



More information about the Bogofilter mailing list