More strangeness in lexer?

Fri Aug 8 18:43:46 CEST 2003

Michael,

'Tis clear you've encountered a problem and I've yet to 
understand/reproduce it.  Sigh :-(

Question:  have you modifications to bogofilter that fix the problem?

>David Relson <relson at osagesoftware.com> writes:
...[snip]...
> > If I remember rightly, there are a lot of rules about how emails are
> > supposed to end.  For example, the last mime body part is supposed to
> > end with a boundary with trailing "--".  Unfortunately mailers often
> > violate the rules and bogofilter needs to be tolerant of abnormal
> > conditions.
>
>In this case though, the MTA isn't at issue, nor is the original sender.
>The only place that "\n\n\From\ " is relevent is at the MDA. If your
>local MDA is getting mbox wrong, then you probably need to change
>MDA's :)

Sounds like one approach would be a state flag that increments on empty 
lines and clears on tokens...

>The escaping of '\n\nFrom\ ' in the mbox format is supposed to mean
>that no-matter what the original email was, or how badly broken
>the sending, the MDA will produce a valid mbox.
>
>The way bogofilter works at the moment, it's simply not sticking
>to the proper mbox format.
>
> > Having lgetsl() call is_from() is done as one of these cases -- to
> > detect the beginning of a new message (even though the old one isn't
> > properly ended).
>
>I suspect that you'll get it the other way around. You'll incorrectly
>split a message when it's just the one email. Noting that with
>bogofilter 0.14.3 I have this:
>
>$ wc -l l.spam.0
>   24437 l.spam.0
>$ bogofilter -s -v ...... -b < l.spam.0
># 356372 words, 24550 messages
>
>which I find mildly entertaining. Where _did_ bogofilter
>manage to find the extra 63 messages from?

The difference of 113 is easily explained if 113 messages have "From" at 
the start of a body line.

Bogofilter's parsing has certain expectations about the mbox files it 
reads.  In particular, it expects message lines beginning with From to be 
escaped, i.e. add a ">" at the beginning.  When this is so, the meaning of 
"^From" is clear and bogofilter has no problems.

You use maildirs at your site, yes?  Do the message files escape From 
lines?  If not, this would explain the 113 "extra" messages.

If you care to isolate any of the messages, I'd be glad to look at 
them.  (Be sure to gzip 'em :-)

> > >The problem here is that the lexer is allowed to
> > >call yy_get_new_line() during look-ahead. So if
> > >the lexer looks ahead, token_init() can wind
> > >up being called before the tokens in question
> > >can be parsed. bad.
> >
> > Is this a theoretical concern, or have you seen bogofilter encounter
> > such a condition?  If the latter case, can you send me one of the
> > problem messages?
>
>Build a message and add
> 
>
>fred<a img="what-not
> >From whym-wham
>">
>
>into the middle. (This is very artificial but I just wanted
>to illustrate). Add a printf() to the token_init() call. Run
>bogolexer
>[....]
>get_token: 2 'C2815290`
>get_token: 2 'C2F54C`
>token_init!
>get_token: 2 'fred`
>get_token: 2 'img`
>get_token: 2 'what-not`
>get_token: 1 'From`
>get_token: 2 'whym-wham`
>get_token: 2 'wierd`
>
>Note that token_init() is called before 'fred' is returned
>to get_token()...

To test this scenario, I created a message that has two copies of the "From 
whym-wham" line.  The first has ">" and the second does not.  I also added 
a printf("token_init!\n") statement.  I've attached my message as a .gz 
file, so that it won't be modified during transit.  My bogolexer output 
just has tokens of type 2 (a.k.a. TOKEN) and doesn't have type 1 
(FROM).  Results below:

[relson at osage src]$ bogolexer -v < msg.mo.0808.txt -C
normal mode.
token_init!
token_init!
get_token: 2 'subj:subject`
get_token: 2 'body`
get_token: 2 'fred`
get_token: 2 'img`
get_token: 2 'what-not`
get_token: 2 'From`
get_token: 2 'whym-wham`
get_token: 2 'wierd`
get_token: 2 'fred`
get_token: 2 'img`
get_token: 2 'what-not`
token_init!
get_token: 2 'whym-wham`
get_token: 2 'wierd`
14 tokens read.

Can you send me your test case as a .gz file?

...[snip]...

> > With the current code in bogofilter, one can usually use "cat *spam* |
> > bogofilter -s" to register a batch of spam messages.  Requiring a
> > blank line between messages would make this fail.
>
>No, because all mbox files end with a blank line. It's part of
>the mbox format. It would still work just fine.

Guess I wasn't clear :-(  My "cat *spam*" was referring to individual 
messages, not mailboxes.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: msg.mo.0808.txt.gz
Type: application/octet-stream
Size: 104 bytes
Desc: not available
URL: <https://www.bogofilter.org/pipermail/bogofilter-dev/attachments/20030808/efe736ac/attachment.obj>