Minimum usable counts [was: Question]

David Relson relson at osagesoftware.com
Mon May 25 15:05:13 CEST 2009


On Mon, 25 May 2009 13:48:16 +0930
Stephen Davies wrote:

> I think I agree now that the amavis header rewrites are not a big
> deal but the CR thing certainly is.
> 
> However, I think you misread the detail here.
> It is bogofilter that is seeing the CR and creating two tokens where
> the same text without the CR gives just one token.

Hi Stephen,

The CR _is_ present.  Since tokens don't include control characters (or
certain other characters), bogofilter is parsing "h<CR>ere" into "h"
and "ere".  Bogofilter is doing exactly what was intended.

RFC822 says that CR and LF must always appear as a CRLF pair.  One
without the other makes the message non-compliant.  Bogofilter _does_
accept some non-compliant constructs since spammers like them.  Whether
lone <CR> characters should be discarded is presently an open question
and needs investigation to see what would happen if bogofilter is
changed.

As a related thought, bogofilter has a "--min-token-len=X" parameter
(with default value of 3).  A value of one will give "h" and "ere" as
scorable tokens.  With message registration, bogofilter will quickly
learn that "h" is spammish.  Try running with a count of 1 and see how
it goes.

Regards,

David




More information about the Bogofilter mailing list