More strangeness in lexer?
michael at optusnet.com.au
michael at optusnet.com.au
Mon Aug 11 00:32:13 CEST 2003
David Relson <relson at osagesoftware.com> writes:
> Michael,
>
> 'Tis clear you've encountered a problem and I've yet to
> understand/reproduce it. Sigh :-(
>
> Question: have you modifications to bogofilter that fix the problem?
Not yet. :)
(Well, I do, but they're bound up with the word-pair scoring and
hinting work I've been doing).
[..]
> >In this case though, the MTA isn't at issue, nor is the original sender.
> >The only place that "\n\n\From\ " is relevent is at the MDA. If your
> >local MDA is getting mbox wrong, then you probably need to change
> >MDA's :)
>
> Sounds like one approach would be a state flag that increments on
> empty lines and clears on tokens...
Actually, I think you can just change the '^From\ ' to be
'\n\nFrom\ ' in the lexer. The lexer is always looking for the
longest match anyway, so that's perfectly safe.... I think. :)
[...]
> >$ wc -l l.spam.0
> > 24437 l.spam.0
> >$ bogofilter -s -v ...... -b < l.spam.0
> ># 356372 words, 24550 messages
> >
> >which I find mildly entertaining. Where _did_ bogofilter
> >manage to find the extra 63 messages from?
>
> The difference of 113 is easily explained if 113 messages have "From"
> at the start of a body line.
>
> Bogofilter's parsing has certain expectations about the mbox files it
> reads. In particular, it expects message lines beginning with From to
> be escaped, i.e. add a ">" at the beginning. When this is so, the
> meaning of "^From" is clear and bogofilter has no problems.
The problem here is that some MDAs will not escape '[^\n]\nFrom '
because it's not a seperator. They'll only escape '\n\nFrom '...
> You use maildirs at your site, yes? Do the message files escape From
yes.
> lines? If not, this would explain the 113 "extra" messages.
No, the maildir format doesn't include escaping '\nFrom '.
[...]
Michael.
More information about the bogofilter-dev
mailing list