... convert_unicode.c ...
David Relson
relson at osagesoftware.com
Thu Jun 23 02:42:20 CEST 2005
On Thu, 23 Jun 2005 00:04:37 +0200 (CEST)
Pavel Kankovsky wrote:
> On Tue, 21 Jun 2005, David Relson wrote:
>
> > Unfortunately I don't have information that says which is better:
> >
> > 1 - no translation
> > 2 - iso-8859-1 to utf-8 translation
>
> I think the latter (8859-1 (or perhaps windows-1252) to UTF-8) is
> better because:
>
> 1. it generates correct tokens when the bogus charset name was
> supposed to be interpreted as our default charset
>
> 2. it does not generate invalid UTF-8 sequences
Using "iso-8859-1" as the default "from charset" seems like a reasonable
default to me.
> > Anybody know if there's an RFC that applies?
>
> RFC 2045:
>
> 5.2. Content-Type Defaults
>
> Default RFC 822 messages without a MIME Content-Type header are taken
> by this protocol to be plain text in the US-ASCII character set,
> which can be explicitly specified as:
>
> Content-type: text/plain; charset=us-ascii
>
> This default is assumed if no Content-Type header field is specified.
> It is also recommend that this default be assumed when a
> syntactically invalid Content-Type header field is encountered. In
> the presence of a MIME-Version header field and the absence of any
> Content-Type header field, a receiving User Agent can also assume
> that plain US-ASCII text was the sender's intent. Plain US-ASCII
> text may still be assumed in the absence of a MIME-Version or the
> presence of an syntactically invalid Content-Type header field, but
> the sender's intent might have been otherwise.
>
>
> RFC 2046:
>
> 4.1.4. Unrecognized Subtypes
>
> Unrecognized subtypes of "text" should be treated as subtype "plain"
> as long as the MIME implementation knows how to handle the charset.
> Unrecognized subtypes which also specify an unrecognized charset
> should be treated as "application/octet- stream".
>
> This would be a nice, straightforward solution but it would make it very
> easy to hide text from Bogofilter.
>
>
> RFC 2049:
>
> 2. MIME Conformance
>
> A mail user agent that is MIME-conformant MUST:
>
> (6) Explicitly handle the following media type values, to
> at least the following extents:
>
> Text:
>
> -- Treat material in an unknown character set as if
> it were "application/octet-stream".
>
>
> See above.
>
>
> By the way...
>
> I am not sure it is a good idea to feed raw untrusted input to
> iconv_open(). It is somewhat dangerous to assume all implementations are
> robust enough to handle any piece of binary crap in iconv_open()
> arguments (e.g. Solaris appears to use the arguments to assemble
> filenames...without any checking!). Moreover, certain values might have
> special magic semantics (e.g. //TRANSLIT and //IGNORE suffixes in GNU
> iconv...well, they recognize them in target charsets only but you get the
> point).
>
> I think we should make sure the charset name we pass to iconv_open() is
> found in the set of known good values. On the other hand, the table of
> known values might be used to implement nonstandard aliases like "cp1250"
> (not registered at IANA) instead of "windows-1250" (registered at IANA).
>
> BTW: comments are allowed in charset names according to RFC 2045.
> It says:
>
> Thus the following two forms
>
> Content-type: text/plain; charset=us-ascii (Plain text)
>
> Content-type: text/plain; charset="us-ascii"
>
> are completely equivalent.
Bogofilter's parsing rules (in lexer_v3.l) have the following line:
CHARSET [[:alnum:]-]+
<INITIAL>charset=\"?{CHARSET}\"? { ... }
which processes the two forms correctly (even though I didn't know
about the comment form until your message).
More information about the bogofilter-dev
mailing list