... convert_unicode.c ...

Thu Jun 23 00:04:37 CEST 2005

On Tue, 21 Jun 2005, David Relson wrote:

> Unfortunately I don't have information that says which is better:
> 
> 1 - no translation
> 2 - iso-8859-1 to utf-8 translation

I think the latter (8859-1 (or perhaps windows-1252) to UTF-8) is
better because:

1. it generates correct tokens when the bogus charset name was
   supposed to be interpreted as our default charset

2. it does not generate invalid UTF-8 sequences

> Anybody know if there's an RFC that applies?

RFC 2045:

5.2.  Content-Type Defaults

   Default RFC 822 messages without a MIME Content-Type header are taken
   by this protocol to be plain text in the US-ASCII character set,
   which can be explicitly specified as:

     Content-type: text/plain; charset=us-ascii

   This default is assumed if no Content-Type header field is specified.
   It is also recommend that this default be assumed when a
   syntactically invalid Content-Type header field is encountered. In
   the presence of a MIME-Version header field and the absence of any
   Content-Type header field, a receiving User Agent can also assume
   that plain US-ASCII text was the sender's intent.  Plain US-ASCII
   text may still be assumed in the absence of a MIME-Version or the
   presence of an syntactically invalid Content-Type header field, but
   the sender's intent might have been otherwise.

RFC 2046:

4.1.4.  Unrecognized Subtypes

   Unrecognized subtypes of "text" should be treated as subtype "plain"
   as long as the MIME implementation knows how to handle the charset.
   Unrecognized subtypes which also specify an unrecognized charset
   should be treated as "application/octet- stream".

This would be a nice, straightforward solution but it would make it very 
easy to hide text from Bogofilter.

RFC 2049:

2.  MIME Conformance

   A mail user agent that is MIME-conformant MUST:

    (6)   Explicitly handle the following media type values, to
          at least the following extents:

          Text:

            -- Treat material in an unknown character set as if
            it were "application/octet-stream".

See above.

By the way...

I am not sure it is a good idea to feed raw untrusted input to
iconv_open(). It is somewhat dangerous to assume all implementations are
robust enough to handle any piece of binary crap in iconv_open()  
arguments (e.g. Solaris appears to use the arguments to assemble
filenames...without any checking!). Moreover, certain values might have
special magic semantics (e.g. //TRANSLIT and //IGNORE suffixes in GNU
iconv...well, they recognize them in target charsets only but you get the
point).

I think we should make sure the charset name we pass to iconv_open() is
found in the set of known good values. On the other hand, the table of
known values might be used to implement nonstandard aliases like "cp1250"
(not registered at IANA) instead of "windows-1250" (registered at IANA).

BTW: comments are allowed in charset names according to RFC 2045.
It says:

   Thus the following two forms

     Content-type: text/plain; charset=us-ascii (Plain text)

     Content-type: text/plain; charset="us-ascii"

   are completely equivalent.

--Pavel Kankovsky aka Peak  [ Boycott Microsoft--http://www.vcnet.com/bms ]
"Resistance is futile. Open your source code and prepare for assimilation."