ignore text/plain part of multipart/alternative messages?

Matthias Andree matthias.andree at gmx.de
Tue Aug 12 23:46:44 CEST 2003


David Flanagan <david at davidflanagan.com> writes:

> The biggest category of spam that's been getting through to me is
> multipart/alternative messages that contain text apparently excerpted
> from books in the text/plain part, and whatever the spammer's payload is
> in the text/html part.
>
> In Paul Graham's latest article, he asserts that this type of spam isn't
> a big deal because the plain/text camouflage doesn't actually look like
> real e-mail.

Either that, or the tokens just are "Unsure" and don't have much
influence on the actual score.

> I'm not sure I agree: the ones that are getting through to me seem to
> be excerpts from political memoirs or something about the Reagan/Bush
> years.  Since I get a lot of legitimate e-mail griping about the
> current Bush administration, these spam get through to me.

Try running bogofilter -vvv to see what tokens matter and what are
ignored by bogofilter. Which bogofilter version and algorithm are you
using?

> In any case, I think there is a (simple?) solution.  For
> multipart/alternative messages, I think that only the default part
> should be tokenized.

I don't think so. a. the "default" part depends on the mail user
agent. Granted, M$ crap will display HTML, so spammers go for
that. b. Bayes' theorem assumes you drop _all_ information about
previous stuff into the weighing scale. c. the HTML part is still there.

Would you care to send me one of those "text/plain is book excerpt,
text/html is UCE" mails so I can have a look? If so, please save the
whole mail to a file ("export") and zip it before you attach, so it
doesn't get filtered out here. You can omit non-MIME headers if you want
to protect your privacy, all I need are MIME-Version: and Content-*:
headers.

-- 
Matthias Andree




More information about the Bogofilter mailing list