... convert_unicode.c ...

Thu Jun 23 02:42:20 CEST 2005

On Thu, 23 Jun 2005 00:04:37 +0200 (CEST)
Pavel Kankovsky wrote:

> On Tue, 21 Jun 2005, David Relson wrote:
> 
> > Unfortunately I don't have information that says which is better:
> > 
> > 1 - no translation
> > 2 - iso-8859-1 to utf-8 translation
> 
> I think the latter (8859-1 (or perhaps windows-1252) to UTF-8) is
> better because:
> 
> 1. it generates correct tokens when the bogus charset name was
>    supposed to be interpreted as our default charset
> 
> 2. it does not generate invalid UTF-8 sequences

Using "iso-8859-1" as the default "from charset" seems like a reasonable
default to me.

> > Anybody know if there's an RFC that applies?
> 
> RFC 2045:
> 
> 5.2.  Content-Type Defaults
> 
>    Default RFC 822 messages without a MIME Content-Type header are taken
>    by this protocol to be plain text in the US-ASCII character set,
>    which can be explicitly specified as:
> 
>      Content-type: text/plain; charset=us-ascii
> 
>    This default is assumed if no Content-Type header field is specified.
>    It is also recommend that this default be assumed when a
>    syntactically invalid Content-Type header field is encountered. In
>    the presence of a MIME-Version header field and the absence of any
>    Content-Type header field, a receiving User Agent can also assume
>    that plain US-ASCII text was the sender's intent.  Plain US-ASCII
>    text may still be assumed in the absence of a MIME-Version or the
>    presence of an syntactically invalid Content-Type header field, but
>    the sender's intent might have been otherwise.
> 
> 
> RFC 2046:
> 
> 4.1.4.  Unrecognized Subtypes
> 
>    Unrecognized subtypes of "text" should be treated as subtype "plain"
>    as long as the MIME implementation knows how to handle the charset.
>    Unrecognized subtypes which also specify an unrecognized charset
>    should be treated as "application/octet- stream".
> 
> This would be a nice, straightforward solution but it would make it very 
> easy to hide text from Bogofilter.
> 
> 
> RFC 2049:
> 
> 2.  MIME Conformance
> 
>    A mail user agent that is MIME-conformant MUST:
> 
>     (6)   Explicitly handle the following media type values, to
>           at least the following extents:
> 
>           Text:
> 
>             -- Treat material in an unknown character set as if
>             it were "application/octet-stream".
> 
> 
> See above.
> 
> 
> By the way...
> 
> I am not sure it is a good idea to feed raw untrusted input to
> iconv_open(). It is somewhat dangerous to assume all implementations are
> robust enough to handle any piece of binary crap in iconv_open()  
> arguments (e.g. Solaris appears to use the arguments to assemble
> filenames...without any checking!). Moreover, certain values might have
> special magic semantics (e.g. //TRANSLIT and //IGNORE suffixes in GNU
> iconv...well, they recognize them in target charsets only but you get the
> point).
> 
> I think we should make sure the charset name we pass to iconv_open() is
> found in the set of known good values. On the other hand, the table of
> known values might be used to implement nonstandard aliases like "cp1250"
> (not registered at IANA) instead of "windows-1250" (registered at IANA).
> 
> BTW: comments are allowed in charset names according to RFC 2045.
> It says:
> 
>    Thus the following two forms
> 
>      Content-type: text/plain; charset=us-ascii (Plain text)
> 
>      Content-type: text/plain; charset="us-ascii"
> 
>    are completely equivalent.

Bogofilter's parsing rules (in lexer_v3.l) have the following line:

CHARSET	[[:alnum:]-]+
<INITIAL>charset=\"?{CHARSET}\"?  { ... }

which processes the two forms correctly (even though I didn't know
about the comment form until your message).