Parsing of certain MIME messages, e.g. Vonage

David Relson relson at osagesoftware.com
Thu Oct 15 06:00:57 CEST 2009


Hi Matt,

Good to hear from you.  It's been awhile ...

On Wed, 14 Oct 2009 15:36:21 -0400
Matt Garretson wrote:

> Greetings, all. Over the years, I've noticed that bogofilter
> sometimes seems to mis-parse messages with MIME attachments.
> Usually, it correctly skips over non-text or non-html 
> attachments, but sometimes it ends up tokenizing the encoded 
> strings of binary attachements. This usually leads to a score
> of .5 due to dozens/hundreds/thousands of brand-new tokens.

...[snip]...

> At first glance, the boundary string seems odd, though I'm
> not sure if that's the root of the problem. 

Agreed. The boundary string is odd since RFC 2046 specifies what's
correct and the "}" is not.  Here's the RFC's specification for
non-blank boundary characters: 

     bcharsnospace := DIGIT / ALPHA / "'" / "(" / ")" /
                      "+" / "_" / "," / "-" / "." /
                      "/" / ":" / "=" / "?"

> My bogolexer output showing the errant tokens is here:
> 
>   http://pastebin.com/m3fb9a0bd

'Tis good that you've included the bogolexer output.  I prefer a
somewhat different option set, i.e. "-xM -xL -vvv -p -q" which
suppresses the "get_token: 1" on each line, but shows additional
(admittedly cryptic) info about what's happening.

> Any thoughts? My Bogofilter version is 1.2.1 built from source 
> on Fedora 11.

Bogofilter appears to be overlooking the "audio/wav" specification
though file src/mime.c maps "audio/" to MIME_AUDIO which means "don't
decode".  I'll have to dig in and see what's happening.

Regards,

David



More information about the Bogofilter mailing list