[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] URL encoding in html page



Dan asked:

> Kenneth Whistler writes:
> > The *real* problem is guaranteeing interoperability for UTF-8, UTF-16,
> > and UTF-32, which are the three sanctioned encoding forms of Unicode
> 
> The obvious choice for Internet protocols is UTF-8. See RFC 2277.

No argument there.

> Systems that use 16-bit encodings internally, such as Windows, handle
> UTF-8 conversions at the boundary between the system and the network.
> 
> What's the problem?

What I am referring to is code point range interoperability for the 3 encoding
forms of Unicode. The Unicode Standard is tied to ISO/IEC 10646, which
is architecturally a 31-bit character encoding standard, by which I mean
that its nominal code space is 0..0x7FFFFFFF.

The Unicode Standard formally limits that code space, however, to
a 21-bit range, namely 0..0x10FFFF. The reason for that is that
UTF-16 can only address that range. In the Unicode Standard, then,
UTF-32 is also constrainted to 0..0x10FFFF, and UTF-8 is constrained
to four-byte forms up to <F4 8F BF BF> (i.e. U+10FFFF). *That* is the
guarantee of interoperability, since it means that any valid value
in UTF-8 can be accurately converted to either of the two other
forms, and vice versa.

*If* 10646 were ever to encode a character at a code point beyond
0x10FFFF, *then* there would be an interoperability problem. And
that is why 10646 has been amended recently to retrofit the same
constraints on allowable ranges for encoding as specified in the
Unicode Standard. That is everyone's guarantee that neither SC2/WG2
nor the Unicode Consortium is going to encode a character that
breaks encoding form interoperability, no matter which of the three
forms (or combinations thereof) you are using for an implementation.

The reason I brought this up at all was to head off the zany
garden-path discussions about "UTF-128" and extending UTF-8
because putatively there might not be enough code points at
some unspecified time in the future. Let the character encoding
committees deal with that issue. In the meantime, the IETF (and
this IDN WG) have the Unicode Standard and the IS 10646, with
their standard encoding forms -- just use them, with the guarantee
of interoperability that the relevant committees are providing,
and don't hare off into discussions about their supposed inadequacy
or limitations.

> Converting between UTF-8 and UTF-16 and UTF-32 doesn't cause IDNA-style
> interoperability failures.

Correct. Because the UTC has limited the code space range to ensure
that interoperability.

> Of course, Windows still has all sorts of problems related to its old
> ``code pages,'' and there are many similar problems with old character
> encodings under UNIX. The use of more than one 8-bit ASCII extension
> provides ample opportunity for IDNA-style interoperability failures.
> UTF-8 is a way out of this mess.

Actually, Unicode is a way out of this mess. And then what particular
encoding form(s) you choose depend on the requirements of the protocol
or application you are designing and implementing.

--Ken