unescaped "From " lines [was: results with latest beta]
relson at osagesoftware.com
Sun Jan 26 09:51:36 EST 2003
You do have an interesting spam corpus, don't you? Escaping "From" as
"=46rom" is new and different. Handling it correctly may not be a one or
two line change. Let me think out loud, I mean, describe the situation.
Messages have headers and bodies. "From " as a message separator only
occurs in the header. If bogofilter knew whether it was in a header or a
body, life would be simple. The first state change from header to body is
trivial - just the first empty line.
When to change from body back to header is not trivial. A standard mailbox
with plain text, i.e. neither base64 nor qp, will escape all "From " lines
in message bodies. Bogofilter handles that just fine.
Encoded message bodies are more difficult. It's easy enough to turn on
decoding, but when to turn it off is trickier. Consider a message that's a
header and an encoded body, say base64, and with no mime parts. As of a
couple days ago, bogofilter blithely decoded all lines until "From " was
encountered. A recent change was to test for "From " and skip decoding
that line. It seems like more changes are neeeded in that area (parsing).
At present, bogofilter has 3 lexer components. lexer_head.l knows about
header stuff, while lexer_text_plain.l deals with plain text and
lexer_text_html.l deals with html text. Some lexer rules appear in all
three and other rules appear only in one. In general, body text needs to
be decoded before the lexer rules are applied. This makes it possible for
bogofilter to see tokens in encoded text.
"From " is in all three lexer components. The text lexers use that rule to
shift out of body mode into header mode.
The problem here seems to be that text is read, decoded, and parsed (in
that order). The decoding of "=46rom" produces "From" which is recognized
by the parser.
Maybe the answer is simple. Have only one check (rather than 3) for "From
" and have that check be in the routine that gets text for the
lexers. Probably it won't be quite that easy. Likely there will need to
be several checks, possibly complementary.
An alternate possibility is to check whether the text is plain or encoded
and only treat a plain "From " as the start of a new message.
Anyhow, having thought about this, I'll do some experimenting and see what
I can come up with. I hope this info sheds some light on why it's so
tricky and complicated.
At 11:46 PM 1/25/03, Matt Armstrong wrote:
>Matt Armstrong <matt at lickey.com> writes:
> > For the first time bogofilter 0.10.x can parse my SPAM mailbox
> > without crashing -- yay! It gets the message count wrong (6916
> > messages -vs- the actual 6899 in one mbox, 9372 -vs- the actual 9362
> > in another).
>Some of the differences here were due to bogus unescaped "From " lines
>in message bodies.
>However, bogofilter still gets the count wrong. I didn't track down
>every message, but I did get two and saw a pattern.
>It seems that bogofilter does quoted-printable decoding before mbox
>"From " processing, so a message with a quoted-printable body
>containing a line beginning with "=46rom" will count as a new message
More information about the Bogofilter-dev