PATCH to ignore uuencoded attachments

Sat Nov 27 02:03:11 CET 2004

On Sat, 27 Nov 2004 01:19:18 +0100
Matthias Andree wrote:

> David Relson <relson at osagesoftware.com> writes:
> 
> > I've attached a patch for lexer_v3.l that I think will help you.  
> >
> > Checking my archives (approx 350,000 messages), I found 6 messages
> > with"begin 666" in them, and 3 of those were from Oct 2003 when you
> > posted this very same problem.  The lack of test cases means I can't
> > thoroughly test the patch.  Please give it a good workout and let me
> > know if the patch works for you.
> 
> I'm not comfortable with the patch.
> 
> Mails that try to hide content from Outlook Express are in the wild,
> and we mustn't let ourselves be fooled by them.
> 
> > +<TEXT>begin\ 666\ .*  				{ BEGIN UUENCODED; }
> > +<UUENCODED>end$					{ BEGIN TEXT; }
> 
> These must be anchored to the beginning of the line with "^" (is also
> more efficient).

You're absolutely correct, the pattern needs to be anchored at both
ends, perhaps:

<TEXT>^begin\ 666\ [^ ]*$

To allow exactly two spaces ...

> A nul-length line, that is one matching the regexp   ^[` ]$   also
> ends the uuencoded part.

<UUENCODE>\n$

> Microsoft also appears to accept uuencode if the mode is the empty
> string, i. e. "begin  filename" (with two blanks).
> 
> > +<UUENCODED>{TOKEN}				/* ignore tokens */
> > +<UUENCODED>\${NUM}(\.{NUM})?			/* ignore money */

Evgeny's test case included token '$1' within the uuencoded text. 
As '$1' isn't matched by {TOKEN}, it wasn't being discarded. 
So _something_ extra is needed to discard it, and the money
pattern fit the bill.  Odd, eh?

> *shrug* These shouldn't happen, and $ is a valid character in
> uuencoded mode.
> 
> Actually, we would want to check that we're actually seeing uuencoded
> lines. That would be some custom function that calls BEGIN TEXT; if
> the line is corrupt, to defang the Outlook confusion
> mails. ("I-love-you-signature", a fake virus to scare Outlook Express
> users)

Actually checking is tricky as it involves reading additional lines of
text :-<

Here's modified lexer code that deals with the above comments (but
doesn't ensure that uuencoded text is really present):

<TEXT>^begin\ 666\ [^ ]*$			{ BEGIN UUENCODED; }
<UUENCODED>end$					{ BEGIN TEXT; }
<UUENCODED>\n$					{ BEGIN TEXT; }
<UUENCODED>{TOKEN}				/* ignore tokens */
<UUENCODED>\${NUM}(\.{NUM})?			/* ignore money */

With the above code (and the message below), command "bogolexer -p <
msg.begin.666.txt" finds 27 tokens, which seems about right.

### begin msg.begin.666.txt ###
Message-ID: <006a01c401f8$1d7d88e0$6b02a8c0 at blabla.ru>
From: "AAA" <aaa at blabla.ru>
To: <bbb at blabla.msk.su>
Subject: some text
Date: Thu, 4 Mar 2004 17:51:02 +0300
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2600.0000
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2600.0000

some text in koi8-r coding

begin 666 LK2540-7R.pdf
M)5!$1BTQ+C0-)>+CS],-"C$@,"!O8FH-/#P@#2]4>7!E("]086=E( TO4&%R
M96YT(#$U(# @4B -+U)E<V]U<F-E<R R(# @4B -+T-O;G1E;G1S(#,@,"!2
M( TO365D:6%";W@@6R P(# @-#<S(#8V.2!=( TO0W)O<$)O>"!;(# @," T
end

this is a test

### begin msg.begin.666.txt ###