More strangeness in lexer?

michael at optusnet.com.au michael at optusnet.com.au
Mon Aug 11 00:32:13 CEST 2003


David Relson <relson at osagesoftware.com> writes:
> Michael,
> 
> 'Tis clear you've encountered a problem and I've yet to
> understand/reproduce it.  Sigh :-(
> 
> Question:  have you modifications to bogofilter that fix the problem?

Not yet. :)
(Well, I do, but they're bound up with the word-pair scoring and
hinting work I've been doing).

[..] 
> >In this case though, the MTA isn't at issue, nor is the original sender.
> >The only place that "\n\n\From\ " is relevent is at the MDA. If your
> >local MDA is getting mbox wrong, then you probably need to change
> >MDA's :)
> 
> Sounds like one approach would be a state flag that increments on
> empty lines and clears on tokens...

Actually, I think you can just change the '^From\ ' to be
'\n\nFrom\ ' in the lexer. The lexer is always looking for the
longest match anyway, so that's perfectly safe.... I think. :)

[...] 
> >$ wc -l l.spam.0
> >   24437 l.spam.0
> >$ bogofilter -s -v ...... -b < l.spam.0
> ># 356372 words, 24550 messages
> >
> >which I find mildly entertaining. Where _did_ bogofilter
> >manage to find the extra 63 messages from?
> 
> The difference of 113 is easily explained if 113 messages have "From"
> at the start of a body line.
> 
> Bogofilter's parsing has certain expectations about the mbox files it
> reads.  In particular, it expects message lines beginning with From to
> be escaped, i.e. add a ">" at the beginning.  When this is so, the
> meaning of "^From" is clear and bogofilter has no problems.

The problem here is that some MDAs will not escape '[^\n]\nFrom '
because it's not a seperator. They'll only escape '\n\nFrom '...
 
> You use maildirs at your site, yes?  Do the message files escape From

yes.

> lines?  If not, this would explain the 113 "extra" messages.

No, the maildir format doesn't include escaping '\nFrom '.

[...] 


Michael.




More information about the bogofilter-dev mailing list