ignore text/plain part of multipart/alternative messages?

David Flanagan david at davidflanagan.com
Tue Aug 12 19:13:32 CEST 2003


> >The biggest category of spam that's been getting through to me is
> >multipart/alternative messages that contain text apparently excerpted
> >from books in the text/plain part, and whatever the spammer's payload is
> >in the text/html part.
>
> Is that true?

Its true for me.

> Until yesterday (!) I had my
> http://piology.org/.procmailrc.html (web page not yet
> adjusted) shrink multipart/alternative to the first
> text/plain part (if exists), before bogofilter checked the
> message. I have never seen what you describe in false

Hmm.  I didn't know that one could do this.  The problem I see with this
approach is that when the text/plain and text/html portions differ, the
spam payload is always going to be in the text/html part. Like you, I
read my mail with a plain text reader, too, so I'm not interested in the
text/html portion.  But that is where the spam is, and I'd want to keep
that part around to enhance bogofilter's effectiveness.  If I was
willing to adopt a text/html reader, then stripping the text/plain part
in procmail would be a useful enhancement.

> >In Paul Graham's latest article, he asserts that this type of spam isn't
> >a big deal because the plain/text camouflage doesn't actually look like
> >real e-mail.  
> 
> If this method is used -- again, I haven't seen it -- my
> experience agrees with Paul's.

Maybe it is just my bad luck that the spammers choose random texts that
happen to be similiar to my ham...  In any case, even if these random
texts don't result in false negatives, they still have the potential to
dilute the effectiveness of the database.

> >In any case, I think there is a (simple?) solution.  For
> >multipart/alternative messages, I think that only the default part
> >should be tokenized.  I'm sure that 99% of the mail-readers out there
> >display the text/html part of these messages, 
> 
> Do you? Mine does not (agreed only few people use it), but
> there are more like this.

Maybe 99% is an overstatement, but the vast majority (think of Windows
users with Outlook express).  Implicit in multipart/alternative messages
is the assumption that the "richest" message format that can be
displayed will be used.  The text/html portion is the default one.  The
text/plain portion is the fallback for geeks like us.

I see four possible cases.  A multipart/alternative message with
text/plain and text/html parts can either be:

1) Correctly formatted with the same message in both parts.  In this
   case tokenizing both parts double-counts the tokens.  (Whether they
   are ham or spam.)  I believe that multipart/alternative requires the
   alternative message part to contain the same message, to the extent
   allowed by the different message types.  In theory, then, it is
   always safe to look at only one part of the message and ignore the
   rest.  So my proposal to ignore the text/plain portion of these
   messages would solve the double-counting problem.

2) Incorrectly formatted, with a spam payload in the text/html portion
   (where most receipients will see it) and random text (to dilute our
   statistics) in the text/plain portion.  For me, two or three of these
   get through bogofilter each day.  In my proposal, bogofilter would
   look at the spam and ignore the camoflauge.

3) Incorrectly formatted with different spam in the two parts.  Perhaps
   a spammer figures that people reading with text/plain are real studs
   with lots of money and need a new stock tips more than they need
   enlargement.  I've never seen this, and I doubt it would occur.  But
   if it does, bogofilter would still have the html spam to base its
   spamicity decision on.  Not much harm comes from ignoring the plain
   text spam.

4) Incorrectly formatted with the spam payload in text/plain and random
   words in text/html. Then only us old-fashioned text/plain folks would
   see the spam, and most receipients will see random words.  No spammer
   is going to do this.  Since this case is unlikely to occur, there is
   no harm from ignoring the plain/text part.

   David Flanagan




More information about the Bogofilter mailing list