[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Thread on - Re: [idn] Prohibit CDN code points



L.M.Tseng asked:

> Dear All:
> In the draft ,
> http://www.ietf.org/internet-drafts/draft-ietf-idn-idna-06.txt
> define the single Unicode code point as follow:
> 
> Unicode [UNICODE] is a coded character set containing tens of thousands
> of characters. A single Unicode code point is denoted by "U+" followed
> by four to six hexadecimal digits, while a range of Unicode code points
> is denoted by two hexadecimal numbers separated by "..", with no
> prefixes.

Patrick, Paul, or Adam may offer further clarification, but this is
basically a Unicode nomenclatural issue.

The string "U+006A" is a denotation for the Unicode code point (in
the overall range of possible values 0..10FFFF), as well as the
character encoded at that code point, namely LATIN SMALL LETTER J.

The case doesn't matter, although the Unicode Standard most
often uses uppercase. So some people would also use "U+006a" or 
"u+006a" for the same Unicode code point.

> 
> My question are:
> Q1:   U+hhhh  can be represented as u+hhhh  or not ?

Yes. And you can also just leave off
the U+ altogether where it is clear you are referring to
Unicode characters, i.e. "hhhh", so for the LATIN SMALL LETTER J,
just "006A" or "006a".

> Q2:   Here U+HHHH  is not a hostname , does it MUST be forced to lower
> u+hhhh or not  in nameprep ?

I think you are mixing things up. If you put a Unicode character
into a hostname, you don't literally put the string "U+006A" (or
whatever) into the hostname, you put the Unicode encoded representation,
in whatever form of Unicode you are using, into the hostname.

Thus, if my hostname was "jam", in Unicode UTF-8, that would be
just 0x6A 0x61 0x6D, since the Unicode values for ASCII characters
like "j" are the same as ASCII in UTF-8.

If my hostname was the Chinese word for 'banana', just to pick
a random example, that consists of two characters (pinyin: xiang1jiao1).
The Unicode values for those characters are U+9999 U+8549. If you
have a Unicode string, that would just be two 16-bit numbers,
0x9999 followed by 0x8549, if using Unicode UTF-16, or the
following byte sequence if using Unicode UTF-8: 0xE9 0xA6 0x99 
0xE8 0x95 0x89.

> Q3:  Puny code  draft  accept  U+hhhh  or  u+hhhh  to let the final encoded
> ASCII character (last character of corresponding  encoded code point)  with
> case upper or lower.

If I am interpreting things correctly, Punycode is defined on the 
Unicode code points, and certainly not on
the short identifier strings for the Unicode code points.
So for the Chinese 'banana' example, you'd be encoding two
code point integers (39321 = 0x9999, followed by 34121 = 0x8549),
*not* the string of integers corresponding to the ASCII
string "U+9999U+8549".

Of course, somebody might want to try having a hostname or
domain name of "U+9999U+8549", but that is a 12 character ASCII
string, and is not the same thing at all as the two-character
Unicode string for the Chinese word for 'banana'.

--Ken

> 
> I   hope draft authors can help to clarify these interconnection point .
> 
> L.M.Tseng