unescaped "From " lines

Sun Jan 26 18:54:33 CET 2003

At 12:26 PM 1/26/03, Gyepi SAM wrote:

>On Sun, Jan 26, 2003 at 09:51:36AM -0500, David Relson wrote:
> > Messages have headers and bodies.  "From " as a message separator only
> > occurs in the header.  If bogofilter knew whether it was in a header or a
> > body, life would be simple.  The first state change from header to body is
> > trivial - just the first empty line.
> >
> > When to change from body back to header is not trivial.
>
>In fact, it cannot be done correctly all of the time if we accept that 
>'=46rom ' in a qp encoded body
>decodes to 'From ' with no other contextual information.
>We could say that '^From ' inside an encoded body part is just a token and 
>does not mark a new message.
>This complicates things even more, but is doable.  What about '^From ' 
>inside a single message in a Maildir
>mailbox? Do we now have to know the mailbox format?
>
>I say we stop trying to decide and simply say that bogofilter deals with a 
>single message at a time.
>We pay a penalty when training on a mailbox, but IMO, that penalty is not 
>enough to justify the contortions
>we must go to handle an infrequent edge case. Most of the time, bogofilter 
>deals with single messages anyway.
>
>If it turns out that the penalty is unacceptable, I will write 
>bogotrain-mbx or whatever we call it.

Gyepi,

We could do that.  It would be very easy to do, but I hate to lose any 
features that are in bogofilter.   Besides, I think I've got the solution!

At a very low level, specifically function lgetsl() in lexer.c, after 
getting a buffer of text test for "From ".  If so, call function got_from() 
which does a mime_reset, etc.  Of most significance (to the current 
conversation), it sets msg_header=true.

In the 3 lexer grammars, replace the call to got_from() with "return 
(msg_header ? FROM : TOKEN) ; }".

The first bit of code is executed prior to any decoding, so its effect is 
limited to plain text occurrences of "From ".  The second bit will only 
call it a FROM token when bogofilter is processing a message 
header.  Otherwise it's treated as an ordinary token.

None of the regression tests are broken by the above changs.  As an 
additional test, I took Matt Armstrong's messages and created a mailbox 
with 2 copies of each of them.  "bogofilter -s" reports "1328 words, 4 
messages" for that mailbox - which is the correct message count.

I'll publish the patch in a few minutes, so people can look at and test the 
code.  First, however, I need to remove all the other debugging stuff I 
added to my source tree.

David