PATCH to ignore uuencoded attachments
David Relson
relson at osagesoftware.com
Sat Nov 27 02:03:11 CET 2004
On Sat, 27 Nov 2004 01:19:18 +0100
Matthias Andree wrote:
> David Relson <relson at osagesoftware.com> writes:
>
> > I've attached a patch for lexer_v3.l that I think will help you.
> >
> > Checking my archives (approx 350,000 messages), I found 6 messages
> > with"begin 666" in them, and 3 of those were from Oct 2003 when you
> > posted this very same problem. The lack of test cases means I can't
> > thoroughly test the patch. Please give it a good workout and let me
> > know if the patch works for you.
>
> I'm not comfortable with the patch.
>
> Mails that try to hide content from Outlook Express are in the wild,
> and we mustn't let ourselves be fooled by them.
>
> > +<TEXT>begin\ 666\ .* { BEGIN UUENCODED; }
> > +<UUENCODED>end$ { BEGIN TEXT; }
>
> These must be anchored to the beginning of the line with "^" (is also
> more efficient).
You're absolutely correct, the pattern needs to be anchored at both
ends, perhaps:
<TEXT>^begin\ 666\ [^ ]*$
To allow exactly two spaces ...
> A nul-length line, that is one matching the regexp ^[` ]$ also
> ends the uuencoded part.
<UUENCODE>\n$
> Microsoft also appears to accept uuencode if the mode is the empty
> string, i. e. "begin filename" (with two blanks).
>
> > +<UUENCODED>{TOKEN} /* ignore tokens */
> > +<UUENCODED>\${NUM}(\.{NUM})? /* ignore money */
Evgeny's test case included token '$1' within the uuencoded text.
As '$1' isn't matched by {TOKEN}, it wasn't being discarded.
So _something_ extra is needed to discard it, and the money
pattern fit the bill. Odd, eh?
> *shrug* These shouldn't happen, and $ is a valid character in
> uuencoded mode.
>
> Actually, we would want to check that we're actually seeing uuencoded
> lines. That would be some custom function that calls BEGIN TEXT; if
> the line is corrupt, to defang the Outlook confusion
> mails. ("I-love-you-signature", a fake virus to scare Outlook Express
> users)
Actually checking is tricky as it involves reading additional lines of
text :-<
Here's modified lexer code that deals with the above comments (but
doesn't ensure that uuencoded text is really present):
<TEXT>^begin\ 666\ [^ ]*$ { BEGIN UUENCODED; }
<UUENCODED>end$ { BEGIN TEXT; }
<UUENCODED>\n$ { BEGIN TEXT; }
<UUENCODED>{TOKEN} /* ignore tokens */
<UUENCODED>\${NUM}(\.{NUM})? /* ignore money */
With the above code (and the message below), command "bogolexer -p <
msg.begin.666.txt" finds 27 tokens, which seems about right.
### begin msg.begin.666.txt ###
Message-ID: <006a01c401f8$1d7d88e0$6b02a8c0 at blabla.ru>
From: "AAA" <aaa at blabla.ru>
To: <bbb at blabla.msk.su>
Subject: some text
Date: Thu, 4 Mar 2004 17:51:02 +0300
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2600.0000
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2600.0000
some text in koi8-r coding
begin 666 LK2540-7R.pdf
M)5!$1BTQ+C0-)>+CS],-"C$@,"!O8FH-/#P@#2]4>7!E("]086=E( TO4&%R
M96YT(#$U(# @4B -+U)E<V]U<F-E<R R(# @4B -+T-O;G1E;G1S(#,@,"!2
M( TO365D:6%";W@@6R P(# @-#<S(#8V.2!=( TO0W)O<$)O>"!;(# @," T
end
this is a test
### begin msg.begin.666.txt ###
More information about the bogofilter-dev
mailing list