unescaped "From " lines
relson at osagesoftware.com
Sun Jan 26 12:54:33 EST 2003
At 12:26 PM 1/26/03, Gyepi SAM wrote:
>On Sun, Jan 26, 2003 at 09:51:36AM -0500, David Relson wrote:
> > Messages have headers and bodies. "From " as a message separator only
> > occurs in the header. If bogofilter knew whether it was in a header or a
> > body, life would be simple. The first state change from header to body is
> > trivial - just the first empty line.
> > When to change from body back to header is not trivial.
>In fact, it cannot be done correctly all of the time if we accept that
>'=46rom ' in a qp encoded body
>decodes to 'From ' with no other contextual information.
>We could say that '^From ' inside an encoded body part is just a token and
>does not mark a new message.
>This complicates things even more, but is doable. What about '^From '
>inside a single message in a Maildir
>mailbox? Do we now have to know the mailbox format?
>I say we stop trying to decide and simply say that bogofilter deals with a
>single message at a time.
>We pay a penalty when training on a mailbox, but IMO, that penalty is not
>enough to justify the contortions
>we must go to handle an infrequent edge case. Most of the time, bogofilter
>deals with single messages anyway.
>If it turns out that the penalty is unacceptable, I will write
>bogotrain-mbx or whatever we call it.
We could do that. It would be very easy to do, but I hate to lose any
features that are in bogofilter. Besides, I think I've got the solution!
At a very low level, specifically function lgetsl() in lexer.c, after
getting a buffer of text test for "From ". If so, call function got_from()
which does a mime_reset, etc. Of most significance (to the current
conversation), it sets msg_header=true.
In the 3 lexer grammars, replace the call to got_from() with "return
(msg_header ? FROM : TOKEN) ; }".
The first bit of code is executed prior to any decoding, so its effect is
limited to plain text occurrences of "From ". The second bit will only
call it a FROM token when bogofilter is processing a message
header. Otherwise it's treated as an ordinary token.
None of the regression tests are broken by the above changs. As an
additional test, I took Matt Armstrong's messages and created a mailbox
with 2 copies of each of them. "bogofilter -s" reports "1328 words, 4
messages" for that mailbox - which is the correct message count.
I'll publish the patch in a few minutes, so people can look at and test the
code. First, however, I need to remove all the other debugging stuff I
added to my source tree.
More information about the Bogofilter-dev