More strangeness in lexer?

David Relson relson at osagesoftware.com
Thu Aug 7 05:10:36 CEST 2003


At 08:38 PM 8/6/03, michael at optusnet.com.au wrote:

>Why does the pattern
>
>^From\
>
>call 'is_from' which checks for the exact same match?
>
>And why does lexer.c:lgetsl() call 'is_from'?

There may be unnecessary calls to is_from().  I'll investigate.

'^From " is the message separator, but only if it's properly 
capitalized.  The parser is operating in case insensitive mode, so 
is_from() is called to ensure that "From " was encountered (and other 
capitalizations were not).

If I remember rightly, there are a lot of rules about how emails are 
supposed to end.  For example, the last mime body part is supposed to end 
with a boundary with trailing "--".  Unfortunately mailers often violate 
the rules and bogofilter needs to be tolerant of abnormal conditions.

Having lgetsl() call is_from() is done as one of these cases -- to detect 
the beginning of a new message (even though the old one isn't properly ended).

>The problem here is that the lexer is allowed to
>call yy_get_new_line() during look-ahead. So if
>the lexer looks ahead, token_init() can wind
>up being called before the tokens in question
>can be parsed. bad.

Is this a theoretical concern, or have you seen bogofilter encounter such a 
condition?  If the latter case, can you send me one of the problem messages?

>It would be better to have token.c:get_token()
>call token_init() when it sees a 'FROM' token
>returned, yes?

No.  Using the suggested event order, the state variable "msg_header" 
wouldn't be true while processing the From line.  It's been a while since I 
closely examined that code, so I don't recall if the suggested event order 
breaks anything important.  As a quick test, I tend to make the change, run 
"make check", and look to see what gets broken.

>Note that either way, the pattern is broken.
>The '^From ' MUST follow a blank line for
>it to be a seperator.
>
>I.e. "fred\nFrom hi" is NOT a seperator, but
>"fred\n\nFrom hi" is.
>
>(More properly, emails stored in an mbox
>file must end with a blank line, but lets
>not split hairs. :)
>
>Michael.

See the above comment on abnormal message ends and the need to recognize 
the start of a new message.

With the current code in bogofilter, one can usually use "cat *spam* | 
bogofilter -s" to register a batch of spam messages.  Requiring a blank 
line between messages would make this fail.

Having responded to all your questions, it seems to me that the code is 
doing what I want it to do and changes aren't needed.  Let me know if I've 
overlooked something or have something wrong.  Also, experimentation is a 
great tool for learning if it's OK to change the code, or if the change 
breaks something.





More information about the bogofilter-dev mailing list