Parsing of certain MIME messages, e.g. Vonage
David Relson
relson at osagesoftware.com
Thu Oct 15 06:00:57 CEST 2009
Hi Matt,
Good to hear from you. It's been awhile ...
On Wed, 14 Oct 2009 15:36:21 -0400
Matt Garretson wrote:
> Greetings, all. Over the years, I've noticed that bogofilter
> sometimes seems to mis-parse messages with MIME attachments.
> Usually, it correctly skips over non-text or non-html
> attachments, but sometimes it ends up tokenizing the encoded
> strings of binary attachements. This usually leads to a score
> of .5 due to dozens/hundreds/thousands of brand-new tokens.
...[snip]...
> At first glance, the boundary string seems odd, though I'm
> not sure if that's the root of the problem.
Agreed. The boundary string is odd since RFC 2046 specifies what's
correct and the "}" is not. Here's the RFC's specification for
non-blank boundary characters:
bcharsnospace := DIGIT / ALPHA / "'" / "(" / ")" /
"+" / "_" / "," / "-" / "." /
"/" / ":" / "=" / "?"
> My bogolexer output showing the errant tokens is here:
>
> http://pastebin.com/m3fb9a0bd
'Tis good that you've included the bogolexer output. I prefer a
somewhat different option set, i.e. "-xM -xL -vvv -p -q" which
suppresses the "get_token: 1" on each line, but shows additional
(admittedly cryptic) info about what's happening.
> Any thoughts? My Bogofilter version is 1.2.1 built from source
> on Fedora 11.
Bogofilter appears to be overlooking the "audio/wav" specification
though file src/mime.c maps "audio/" to MIME_AUDIO which means "don't
decode". I'll have to dig in and see what's happening.
Regards,
David
More information about the Bogofilter
mailing list