[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] surrogates in draft-ietf-idn-nameprep



Section 3.7.2 says

> So far, all proposals for binary encodings of internationalized name
> parts have specified UTF-8 as the encoding format. In such an encoding,
> surrogate characters MUST NOT be used. Therefore, for UTF-8 encodings,
> the following are prohibited:
>
> D800-DFFF   [SURROGATE CHARACTERS]

This is incorrect. A pair of surrogates corresponds to a character in
the 31-bit ISO 10646 code space, and according to RFC2044 anything
up to 2**31 - 1 can be encoded in UTF-8. Simply transform the
UCS-2 to UCS-4 and then into UTF-8.

What might have been meant was that some current implementations of
UTF-8 mishandle surrogates. Actually, the most likely near-term
use for them is in user-defined ideographs (e.g. obscure Chinese
and Japanese personal names) and therefore it is reasonable
to disallow them - just not for the stated reason. Said another
way, since all ISO 10646 characters in the range representable
by pairs of surrogates are currently undefined (except for private
use characters), and the document elsewhere prohibits undefined
characters, we don't need this section at all.