Again - problem with multipart messages

Thu Dec 30 23:41:59 CET 2004

Evgeny,

The best way to deal with these problems is via the mailing list.  In
cases where I don't see the solution for a problem, others may know what
to do.  Using private messages limits the response, so I'm going to
reply to this message using the list.  Hope you don't mind :-)

David

On Wed, 29 Dec 2004 14:01:07 +0300
Evgeny Kotsuba wrote:

...[snip]...

> I am still in search how to count the number of such messages in my 
> collection. I have found two that  are  definitly spam but was in my 
> GoodMail directory :-/

Sorry, no good answers for detecting spam in your goodmail directory. 
Sorting by subject or by sender has, at times, helped me find messages
in the wrong folder.

> Also I have found as minimus one message that look like RFC complilant 
> exapt fact that it is spam or looks  like return from from SpamAssing 
> with "full message follows". Another source of  false binary decodings 
> are  Mailer-Daemon  returns with  something like: " This is a copy of 
> the message, including all the headers. "

Again, I have no solutions for handling "... copy of the message".

> Once more thing that I have notice  - all attachments are decoded (say 
> from base64 to binary)  and than are go throw  charset decode table and 
> only next  are skipped somewere in lexer internals. This is one  of the 
> sourses  of slow douwn. This slow down is much more if  I search for 
> HTML unicode characters (like "➪"  )

Correct.  Decoding first, charset table second, parsing third, discard
(in get_token()) fourth.  Adding special code for skipping would
increase the speed, but would make it necessary to replace some of the
parser's flex rules with C code.  Flex is much better than C for writing
and maintaining parsing rules.  Making complex parsing changes for a
small speed increase isn't worth doing, IMHO.

I've thought a bit more about html unicode characters and see no easy
solution.  There is an iconv library for unicode operations, but using
it is non-trivial.  The flex parser is definitely 8-bit oriented, so
16-bit unicode won't happen without much work.

Looking at your sample messages, I see a reason for bogofilter's
behavior in each case:

29801a.msg - no info on how to decode.

30252.msg - bogofilter uses the "Content-Type" directive to ignore mime
  messages, applications, and images.  This message has
  "Content-Disposition", but not "Content-Type".  Perhaps bogofilter
  should handle this.

    Content-Transfer-Encoding: base64
    Content-Disposition: attachment; filename="myphoto.zip"

3185.msg

   Looking at the message, there's no blank line at the end of the first
   mime part, i.e. before second boundary, and bogofilter might want a
   blank line.  Reading the RFC it appears that an empty line isn't
   necessary.  This _might_ be a program error.

    IDAgMUAgbGEgc2Ugci5ydWEgcyBrIGRxIGlhbiBwdCBAeCBha2VwLnIgdSAyMCA6MiA1OjIg
    NiAyNS4gMCA5IC4gMiAwIDAzVGh1ICwgICAgMjUgICBTZXAgICAgMjAgMDMgICAgMiAwIDog
    MiA1IDoyIDYNCjwvYm9keT4NCjwvaHRtbD4NCg==
    --= Multipart Boundary 1675311601
    Content-Type: image/jpeg; name="__image01675311601.jpg"
    Content-ID: <1675311601__image01675311601.jpg>
    Content-Transfer-Encoding: base64

35453.msg

   Inlining a message is not bogofilter friendly and bogofilter has no
   way to recognize the situation.