More strangeness in lexer?
David Relson
relson at osagesoftware.com
Thu Aug 7 05:10:36 CEST 2003
At 08:38 PM 8/6/03, michael at optusnet.com.au wrote:
>Why does the pattern
>
>^From\
>
>call 'is_from' which checks for the exact same match?
>
>And why does lexer.c:lgetsl() call 'is_from'?
There may be unnecessary calls to is_from(). I'll investigate.
'^From " is the message separator, but only if it's properly
capitalized. The parser is operating in case insensitive mode, so
is_from() is called to ensure that "From " was encountered (and other
capitalizations were not).
If I remember rightly, there are a lot of rules about how emails are
supposed to end. For example, the last mime body part is supposed to end
with a boundary with trailing "--". Unfortunately mailers often violate
the rules and bogofilter needs to be tolerant of abnormal conditions.
Having lgetsl() call is_from() is done as one of these cases -- to detect
the beginning of a new message (even though the old one isn't properly ended).
>The problem here is that the lexer is allowed to
>call yy_get_new_line() during look-ahead. So if
>the lexer looks ahead, token_init() can wind
>up being called before the tokens in question
>can be parsed. bad.
Is this a theoretical concern, or have you seen bogofilter encounter such a
condition? If the latter case, can you send me one of the problem messages?
>It would be better to have token.c:get_token()
>call token_init() when it sees a 'FROM' token
>returned, yes?
No. Using the suggested event order, the state variable "msg_header"
wouldn't be true while processing the From line. It's been a while since I
closely examined that code, so I don't recall if the suggested event order
breaks anything important. As a quick test, I tend to make the change, run
"make check", and look to see what gets broken.
>Note that either way, the pattern is broken.
>The '^From ' MUST follow a blank line for
>it to be a seperator.
>
>I.e. "fred\nFrom hi" is NOT a seperator, but
>"fred\n\nFrom hi" is.
>
>(More properly, emails stored in an mbox
>file must end with a blank line, but lets
>not split hairs. :)
>
>Michael.
See the above comment on abnormal message ends and the need to recognize
the start of a new message.
With the current code in bogofilter, one can usually use "cat *spam* |
bogofilter -s" to register a batch of spam messages. Requiring a blank
line between messages would make this fail.
Having responded to all your questions, it seems to me that the code is
doing what I want it to do and changes aren't needed. Let me know if I've
overlooked something or have something wrong. Also, experimentation is a
great tool for learning if it's OK to change the code, or if the change
breaks something.
More information about the bogofilter-dev
mailing list