Minimum usable counts [was: Question]

Stephen Davies scldad at sdc.com.au
Mon May 25 15:33:32 CEST 2009


Sorry David, I thought my example made it clear.

The actual texts were:

h=<CR><LF>ere

and 

h=<LF>ere

The first case is the raw text received by sendmail, amavis-milter and 
amavisd.

The second is the text presented by kmail.

In the second case, bogofilter is smart enough to get "here" as the token but 
in the first case, the CR broke the algorithm.

Presumably, the = at end-of-line is part of a protocol that bogofilter knows 
but =<CR> is not.

HTH,
Stephen


On Monday 25 May 2009 22:35:13 David Relson wrote:
> On Mon, 25 May 2009 13:48:16 +0930
>
> Stephen Davies wrote:
> > I think I agree now that the amavis header rewrites are not a big
> > deal but the CR thing certainly is.
> >
> > However, I think you misread the detail here.
> > It is bogofilter that is seeing the CR and creating two tokens where
> > the same text without the CR gives just one token.
>
> Hi Stephen,
>
> The CR _is_ present.  Since tokens don't include control characters (or
> certain other characters), bogofilter is parsing "h<CR>ere" into "h"
> and "ere".  Bogofilter is doing exactly what was intended.
>
> RFC822 says that CR and LF must always appear as a CRLF pair.  One
> without the other makes the message non-compliant.  Bogofilter _does_
> accept some non-compliant constructs since spammers like them.  Whether
> lone <CR> characters should be discarded is presently an open question
> and needs investigation to see what would happen if bogofilter is
> changed.
>
> As a related thought, bogofilter has a "--min-token-len=X" parameter
> (with default value of 3).  A value of one will give "h" and "ere" as
> scorable tokens.  With message registration, bogofilter will quickly
> learn that "h" is spammish.  Try running with a count of 1 and see how
> it goes.
>
> Regards,
>
> David



-- 
=============================================================================
Stephen Davies Consulting P/L                             Voice: 08-8177 1595
Adelaide, South Australia.                                Fax  : 08-8177 0133
Computing & Network solutions.                            Mobile:040 304 0583
                                          VoIP:sip:1132210 at sip1.bbpglobal.com



More information about the Bogofilter mailing list