Parsing of certain MIME messages, e.g. Vonage

Thu Oct 15 06:30:24 CEST 2009

On Thu, 15 Oct 2009 00:00:57 -0400
David Relson wrote:

> Hi Matt,
> 
> Good to hear from you.  It's been awhile ...
> 
> On Wed, 14 Oct 2009 15:36:21 -0400
> Matt Garretson wrote:
> 
> > Greetings, all. Over the years, I've noticed that bogofilter
> > sometimes seems to mis-parse messages with MIME attachments.
> > Usually, it correctly skips over non-text or non-html 
> > attachments, but sometimes it ends up tokenizing the encoded 
> > strings of binary attachements. This usually leads to a score
> > of .5 due to dozens/hundreds/thousands of brand-new tokens.
> 
> ...[snip]...
> 
> > At first glance, the boundary string seems odd, though I'm
> > not sure if that's the root of the problem. 
> 
> Agreed. The boundary string is odd since RFC 2046 specifies what's
> correct and the "}" is not.  Here's the RFC's specification for
> non-blank boundary characters: 
> 
>      bcharsnospace := DIGIT / ALPHA / "'" / "(" / ")" /
>                       "+" / "_" / "," / "-" / "." /
>                       "/" / ":" / "=" / "?"
> 
> > My bogolexer output showing the errant tokens is here:
> > 
> >   http://pastebin.com/m3fb9a0bd
> 
> 'Tis good that you've included the bogolexer output.  I prefer a
> somewhat different option set, i.e. "-xM -xL -vvv -p -q" which
> suppresses the "get_token: 1" on each line, but shows additional
> (admittedly cryptic) info about what's happening.
> 
> > Any thoughts? My Bogofilter version is 1.2.1 built from source 
> > on Fedora 11.
> 
> Bogofilter appears to be overlooking the "audio/wav" specification
> though file src/mime.c maps "audio/" to MIME_AUDIO which means "don't
> decode".  I'll have to dig in and see what's happening.
> 
> Regards,
> 
> David

Bogolexer definitely doesn't appreciate the curly brace.

Try the following command:

   bogolexer -p -q < message | grep ^mime:

with the original message and again with the curly brace removed.  The
token count drops from 113 to 67 and looks more reasonable.

Alternately, if you build from source, try the following patch:

--- beginning of patch ---
Index: src/lexer_v3.l
===================================================================
--- src/lexer_v3.l	(revision 6867)
+++ src/lexer_v3.l	(working copy)
@@ -139,8 +139,8 @@
 
 UINT8		([01]?[0-9]?[0-9]|2([0-4][0-9]|5[0-5]))
 IPADDR		{UINT8}\.{UINT8}\.{UINT8}\.{UINT8}
-BCHARSNOSPC	[[:alnum:]()+_,-./:=?#\']
-BCHARS		[[:alnum:]()+_,-./:=?#\' ]
+BCHARSNOSPC	[[:alnum:]()+_,-./:=?#{}\']
+BCHARS		[[:alnum:]()+_,-./:=?#{}\' ]
 MIME_BOUNDARY	{BCHARS}*{BCHARSNOSPC}
 
 ID		<?[[:alnum:]\-\.]+>?
--- end of patch ---


Regards,

David