Minimum usable counts [was: Question]
David Relson
relson at osagesoftware.com
Mon May 25 15:05:13 CEST 2009
On Mon, 25 May 2009 13:48:16 +0930
Stephen Davies wrote:
> I think I agree now that the amavis header rewrites are not a big
> deal but the CR thing certainly is.
>
> However, I think you misread the detail here.
> It is bogofilter that is seeing the CR and creating two tokens where
> the same text without the CR gives just one token.
Hi Stephen,
The CR _is_ present. Since tokens don't include control characters (or
certain other characters), bogofilter is parsing "h<CR>ere" into "h"
and "ere". Bogofilter is doing exactly what was intended.
RFC822 says that CR and LF must always appear as a CRLF pair. One
without the other makes the message non-compliant. Bogofilter _does_
accept some non-compliant constructs since spammers like them. Whether
lone <CR> characters should be discarded is presently an open question
and needs investigation to see what would happen if bogofilter is
changed.
As a related thought, bogofilter has a "--min-token-len=X" parameter
(with default value of 3). A value of one will give "h" and "ere" as
scorable tokens. With message registration, bogofilter will quickly
learn that "h" is spammish. Try running with a count of 1 and see how
it goes.
Regards,
David
More information about the Bogofilter
mailing list