defining empty lines.

David Relson relson at osagesoftware.com
Sat May 17 22:31:56 CEST 2003


At 04:15 PM 5/17/03, Greg Louis wrote:

>On 20030517 (Sat) at 1551:25 -0400, David Relson wrote:
>
> > RFC2822 specifies "The body is simply a sequence of characters that 
> follows
> > the header and is separated from the header by an empty line (i.e., a line
> > with nothing preceding the CRLF).
> >
> > Jeremy Blosser has encountered many spam messages where "\b\r\n" 
> appears in
> > this position.  Bogofilter is looking for the truly empty lines for 
> writing
> > out the "X-Bogosity" line (in passthrough mode) and gets it wrong for 
> these
> > messages.
> >
> > It's easy enough to modify the code to treat any line consisting only of
> > whitespace characters.
>
>Should be recognized and tokenized, maybe.  I don't suppose it happens
>very often in legitimate mail; might be a useful spam indicator!
>
>BTW I'm rerunning my tests of P options (ignore case, tag headers,
>process A IMG and FONT html tags) after correcting two human errors,
>and the results of the first 3 of 4 reruns suggest that ignoring case
>is not a good thing to do, tagging headers is a very good idea, and
>processing those html tag contents helps too.  A proper writeup will
>appear on my website in a day or two.

Greg,

Glad to hear of the preliminary results.  I'm please that they match up 
with what I've seen.  'Tis good when different testers (with their 
different corpora) come to the same conclusion.

Bogofilter could generate a token to indicate "non-blank empty line" or is 
it "non-empty blank line"?  Perhaps "bogus empty line".  I'm thinking using 
a "spc:" prefix (meaning "special") and using "spc:bogus_empty_line" (or 
some such name).

David





More information about the Bogofilter mailing list