unescaped "From " lines [was: results with latest beta]

Sun Jan 26 15:51:36 CET 2003

Matt,

You do have an interesting spam corpus, don't you?  Escaping "From" as 
"=46rom" is new and different.  Handling it correctly may not be a one or 
two line change.  Let me think out loud, I mean, describe the situation.

Messages have headers and bodies.  "From " as a message separator only 
occurs in the header.  If bogofilter knew whether it was in a header or a 
body, life would be simple.  The first state change from header to body is 
trivial - just the first empty line.

When to change from body back to header is not trivial.  A standard mailbox 
with plain text, i.e. neither base64 nor qp, will escape all "From " lines 
in message bodies.  Bogofilter handles that just fine.

Encoded message bodies are more difficult.  It's easy enough to turn on 
decoding, but when to turn it off is trickier.  Consider a message that's a 
header and an encoded body, say base64, and with no mime parts.  As of a 
couple days ago, bogofilter blithely decoded all lines until "From " was 
encountered.  A recent change was to test for "From " and skip decoding 
that line.  It seems like more changes are neeeded in that area (parsing).

At present, bogofilter has 3 lexer components.  lexer_head.l knows about 
header stuff, while lexer_text_plain.l deals with plain text and 
lexer_text_html.l deals with html text.  Some lexer rules appear in all 
three and other rules appear only in one.  In general, body text needs to 
be decoded before the lexer rules are applied.  This makes it possible for 
bogofilter to see tokens in encoded text.

"From " is in all three lexer components.  The text lexers use that rule to 
shift out of body mode into header mode.

The problem here seems to be that text is read, decoded, and parsed (in 
that order).  The decoding of "=46rom" produces "From" which is recognized 
by the parser.

Maybe the answer is simple.  Have only one check (rather than 3) for "From 
" and have that check be in the routine that gets text for the 
lexers.  Probably it won't be quite that easy.  Likely there will need to 
be several checks, possibly complementary.

An alternate possibility is to check whether the text is plain or encoded 
and only treat a plain "From " as the start of a new message.

Anyhow, having thought about this, I'll do some experimenting and see what 
I can come up with.  I hope this info sheds some light on why it's so 
tricky and complicated.

David

At 11:46 PM 1/25/03, Matt Armstrong wrote:

>Matt Armstrong <matt at lickey.com> writes:
>
> > For the first time bogofilter 0.10.x can parse my SPAM mailbox
> > without crashing -- yay!  It gets the message count wrong (6916
> > messages -vs- the actual 6899 in one mbox, 9372 -vs- the actual 9362
> > in another).
>
>Some of the differences here were due to bogus unescaped "From " lines
>in message bodies.
>
>However, bogofilter still gets the count wrong.  I didn't track down
>every message, but I did get two and saw a pattern.
>
>It seems that bogofilter does quoted-printable decoding before mbox
>"From " processing, so a message with a quoted-printable body
>containing a line beginning with "=46rom" will count as a new message
>to bogofilter.
>
>Examples:
>
>     http://www.lickey.com/~matt/bogo/qp-from-1.msg
>     http://www.lickey.com/~matt/bogo/qp-from-2.msg
>