ignore text/plain part of multipart/alternative messages?

Tom Anderson tanderso at oac-design.com
Mon Sep 1 09:54:47 CEST 2003


> In any case, I think there is a (simple?) solution.  For
> multipart/alternative messages, I think that only the default part
> should be tokenized.  I'm sure that 99% of the mail-readers out there
> display the text/html part of these messages, and no spammer is going to
> send spam in the text/plain part.  So don't even bother tokenizing that
> part: just skip to the payload.

In a PGP-signed message, the "default" MIME part is "text/plain" while
the sig is "application/pgp-signature".  However, the content-type of
the MIME message in this case is "multipart/signed".  The point though
is that you can't just skip all "text/plain" portions.

In a message where there is both a "text/plain" part and a "text/html"
(or "text/richtext" or "text/enriched") part, the MIME content-type of
the message is "multipart/mixed" or "multipart/alternative" or sometimes
even "multipart/report".  It is generally "multipart/alternative", but
not always.  A user setting in the mail client is what ultimately
decides which "alternative" is displayed by default.  Generally, neither
is preferred as either the "sales pitch" or the "red herring", as
different users with different mail clients may see only the plain text
or only the HTML.  A spammer really can't know.  They should both be the
same message but in different formats.  Therefore, the "text/plain"
portion is just as relevant to the spam classification as the HTML
part.  For eg., in another spam, the plain-text may be the spam while
the HTML portion is the red herring.

In a "multipart/mixed" message, parts can be tagged with a
"Content-disposition" qualifier which can be either "inline" or
"attachment".  If you are receiving emails with "text/plain"
attachments, then they likely have a "Content-disposition: attachment"
argument in the header of that part of the MIME message.  Check to see
if this is the case.  I would argue that perhaps anything tagged as an
attachment should be ignored by the filter (or passed to a virus
checker).  Any parts with no "content-disposition" qualifier, or those
with "content-disposition: inline", should remain a subject of inquiry
for the filter.

I would also be in favor of ranking each of the MIME parts individually
and then deciding on the whole package based on the most "interesting"
part.  However, a sufficiently innocuous "red herring" package may be as
non-spammish as the payload is spammish, or the other way around in the
case of an acquaintance forwarding you a spam.  How to decide then?  In
that case, I think it would come down to the headers.  If the MIME parts
come in at a tie, the headers should be the deciding vote. 

Sincerely,

Tom Anderson
Order amid Chaos, Inc.
http://oac-design.com





More information about the Bogofilter mailing list