[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] surrogates in draft-ietf-idn-nameprep



At 6:38 AM +1000 8/16/00, Frank Ernens wrote:
>Section 3.7.2 says
>
>>  So far, all proposals for binary encodings of internationalized name
>>  parts have specified UTF-8 as the encoding format. In such an encoding,
>>  surrogate characters MUST NOT be used. Therefore, for UTF-8 encodings,
>>  the following are prohibited:
>>
>>  D800-DFFF   [SURROGATE CHARACTERS]
>
>This is incorrect. A pair of surrogates corresponds to a character in
>the 31-bit ISO 10646 code space, and according to RFC2044 anything
>up to 2**31 - 1 can be encoded in UTF-8. Simply transform the
>UCS-2 to UCS-4 and then into UTF-8.

You may have misunderstood the draft in that it is looking at 
character code points. There is no encoding assumed for the input. 
Surrogate codepoints only make sense when using UTF-16 encoding.

>What might have been meant was that some current implementations of
>UTF-8 mishandle surrogates. Actually, the most likely near-term
>use for them is in user-defined ideographs (e.g. obscure Chinese
>and Japanese personal names) and therefore it is reasonable
>to disallow them - just not for the stated reason. Said another
>way, since all ISO 10646 characters in the range representable
>by pairs of surrogates are currently undefined (except for private
>use characters), and the document elsewhere prohibits undefined
>characters, we don't need this section at all.

Fully disagree. By the time that IDN is finished, 10646 will contain 
values outside plane 0. These will include more than "obscure" Han 
characters. IDN should be able to handle these just as well as any 
other characters.

--Paul Hoffman, Director
--Internet Mail Consortium