[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] URL encoding in html page



Kenneth Whistler writes:
> The *real* problem is guaranteeing interoperability for UTF-8, UTF-16,
> and UTF-32, which are the three sanctioned encoding forms of Unicode

The obvious choice for Internet protocols is UTF-8. See RFC 2277.
Systems that use 16-bit encodings internally, such as Windows, handle
UTF-8 conversions at the boundary between the system and the network.

What's the problem?

Converting between UTF-8 and UTF-16 and UTF-32 doesn't cause IDNA-style
interoperability failures. It's crystal clear which pieces of text are
8-bit and which are 16-bit. Nobody says ridiculous IDNA-type things like
``you should think about converting that to 8 bits if you think it might
be displayed, but definitely leave it as 16 bits if you think another
program will look at it.'' Each interface makes a clear size choice.

Of course, Windows still has all sorts of problems related to its old
``code pages,'' and there are many similar problems with old character
encodings under UNIX. The use of more than one 8-bit ASCII extension
provides ample opportunity for IDNA-style interoperability failures.
UTF-8 is a way out of this mess.

---D. J. Bernstein, Associate Professor, Department of Mathematics,
Statistics, and Computer Science, University of Illinois at Chicago