Again - problem with multipart messages
David Relson
relson at osagesoftware.com
Thu Dec 30 23:41:59 CET 2004
Evgeny,
The best way to deal with these problems is via the mailing list. In
cases where I don't see the solution for a problem, others may know what
to do. Using private messages limits the response, so I'm going to
reply to this message using the list. Hope you don't mind :-)
David
On Wed, 29 Dec 2004 14:01:07 +0300
Evgeny Kotsuba wrote:
...[snip]...
> I am still in search how to count the number of such messages in my
> collection. I have found two that are definitly spam but was in my
> GoodMail directory :-/
Sorry, no good answers for detecting spam in your goodmail directory.
Sorting by subject or by sender has, at times, helped me find messages
in the wrong folder.
> Also I have found as minimus one message that look like RFC complilant
> exapt fact that it is spam or looks like return from from SpamAssing
> with "full message follows". Another source of false binary decodings
> are Mailer-Daemon returns with something like: " This is a copy of
> the message, including all the headers. "
Again, I have no solutions for handling "... copy of the message".
> Once more thing that I have notice - all attachments are decoded (say
> from base64 to binary) and than are go throw charset decode table and
> only next are skipped somewere in lexer internals. This is one of the
> sourses of slow douwn. This slow down is much more if I search for
> HTML unicode characters (like "➪" )
Correct. Decoding first, charset table second, parsing third, discard
(in get_token()) fourth. Adding special code for skipping would
increase the speed, but would make it necessary to replace some of the
parser's flex rules with C code. Flex is much better than C for writing
and maintaining parsing rules. Making complex parsing changes for a
small speed increase isn't worth doing, IMHO.
I've thought a bit more about html unicode characters and see no easy
solution. There is an iconv library for unicode operations, but using
it is non-trivial. The flex parser is definitely 8-bit oriented, so
16-bit unicode won't happen without much work.
Looking at your sample messages, I see a reason for bogofilter's
behavior in each case:
29801a.msg - no info on how to decode.
30252.msg - bogofilter uses the "Content-Type" directive to ignore mime
messages, applications, and images. This message has
"Content-Disposition", but not "Content-Type". Perhaps bogofilter
should handle this.
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="myphoto.zip"
3185.msg
Looking at the message, there's no blank line at the end of the first
mime part, i.e. before second boundary, and bogofilter might want a
blank line. Reading the RFC it appears that an empty line isn't
necessary. This _might_ be a program error.
IDAgMUAgbGEgc2Ugci5ydWEgcyBrIGRxIGlhbiBwdCBAeCBha2VwLnIgdSAyMCA6MiA1OjIg
NiAyNS4gMCA5IC4gMiAwIDAzVGh1ICwgICAgMjUgICBTZXAgICAgMjAgMDMgICAgMiAwIDog
MiA1IDoyIDYNCjwvYm9keT4NCjwvaHRtbD4NCg==
--= Multipart Boundary 1675311601
Content-Type: image/jpeg; name="__image01675311601.jpg"
Content-ID: <1675311601__image01675311601.jpg>
Content-Transfer-Encoding: base64
35453.msg
Inlining a message is not bogofilter friendly and bogofilter has no
way to recognize the situation.
More information about the bogofilter-dev
mailing list