More strangeness in lexer?

Thu Aug 7 09:39:36 CEST 2003

David Relson <relson at osagesoftware.com> writes:
> At 08:38 PM 8/6/03, michael at optusnet.com.au wrote:
> >Why does the pattern
> >
> >^From\
> >
> >call 'is_from' which checks for the exact same match?
> >
> >And why does lexer.c:lgetsl() call 'is_from'?
[...]
> '^From " is the message separator, but only if it's properly
> capitalized.  The parser is operating in case insensitive mode, so
> is_from() is called to ensure that "From " was encountered (and other
> capitalizations were not).

Ahh. I'd forgotten the parse was case insensitive. Ok.

> If I remember rightly, there are a lot of rules about how emails are
> supposed to end.  For example, the last mime body part is supposed to
> end with a boundary with trailing "--".  Unfortunately mailers often
> violate the rules and bogofilter needs to be tolerant of abnormal
> conditions.

In this case though, the MTA isn't at issue, nor is the original sender.
The only place that "\n\n\From\ " is relevent is at the MDA. If your
local MDA is getting mbox wrong, then you probably need to change
MDA's :)

The escaping of '\n\nFrom\ ' in the mbox format is supposed to mean
that no-matter what the original email was, or how badly broken
the sending, the MDA will produce a valid mbox.

The way bogofilter works at the moment, it's simply not sticking
to the proper mbox format.

> Having lgetsl() call is_from() is done as one of these cases -- to
> detect the beginning of a new message (even though the old one isn't
> properly ended).

I suspect that you'll get it the other way around. You'll incorrectly
split a message when it's just the one email. Noting that with
bogofilter 0.14.3 I have this:

$ wc -l l.spam.0
  24437 l.spam.0
$ bogofilter -s -v ...... -b < l.spam.0
# 356372 words, 24550 messages

which I find mildly entertaining. Where _did_ bogofilter
manage to find the extra 63 messages from?

> >The problem here is that the lexer is allowed to
> >call yy_get_new_line() during look-ahead. So if
> >the lexer looks ahead, token_init() can wind
> >up being called before the tokens in question
> >can be parsed. bad.
> 
> Is this a theoretical concern, or have you seen bogofilter encounter
> such a condition?  If the latter case, can you send me one of the
> problem messages?

Build a message and add

fred<a img="what-not
>From whym-wham
">

into the middle. (This is very artificial but I just wanted
to illustrate). Add a printf() to the token_init() call. Run
bogolexer
[....]
get_token: 2 'C2815290`
get_token: 2 'C2F54C`
token_init!
get_token: 2 'fred`
get_token: 2 'img`
get_token: 2 'what-not`
get_token: 1 'From`
get_token: 2 'whym-wham`
get_token: 2 'wierd`

Note that token_init() is called before 'fred' is returned
to get_token()...

For even more amusement, add that fragment into the middle
of an mbox email and run mutt or pine over it. Note that they
don't break it into a seperate message at that point.. :)

> >It would be better to have token.c:get_token()
> >call token_init() when it sees a 'FROM' token
> >returned, yes?
> 
> No.  Using the suggested event order, the state variable "msg_header"
> wouldn't be true while processing the From line.  It's been a while
> since I closely examined that code, so I don't recall if the suggested
> event order breaks anything important.  As a quick test, I tend to
> make the change, run "make check", and look to see what gets broken.

Sorry, that's the whole point. :) You need to wait for the 'From '
token to appear before you can set 'msg_header' to true. The
current code is just broken that way. In the example above
the 'fred' token will be returned with 'msg_header' set
to true which is clearly out of order.

> >I.e. "fred\nFrom hi" is NOT a seperator, but
> >"fred\n\nFrom hi" is.
[...]

> See the above comment on abnormal message ends and the need to
> recognize the start of a new message.
> 
> With the current code in bogofilter, one can usually use "cat *spam* |
> bogofilter -s" to register a batch of spam messages.  Requiring a
> blank line between messages would make this fail.

No, because all mbox files end with a blank line. It's part of
the mbox format. It would still work just fine.

Michael.