[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Chinese Domain Name Consortium (CDNC) Declaration



L.M. Tseng wrote:

> From owner-idn@ops.ietf.org Tue Feb  5 02:36:26 2002
> To: "Erin Chen" <erin@twnic.net.tw>, "Dave Crocker" <dhc@dcrocker.net>
> Cc: "IESG" <iesg@ietf.org>, "IAB" <iab@isi.edu>,
>         "IETF IDN WG" <idn@ops.ietf.org>
> Subject: Re: [idn] Chinese Domain Name Consortium (CDNC) Declaration

> Dear Dave Crocker:
>                   My  friend give me an example about  CJK UNICODE ,  It is
> so ambiguous to me to deifferentiate which  one is a correct Chinese
> characters or  not ?  In  our  hand writting , each pair are used and mixed
> .
> 
> 淸眞敎 U+6DF8 U+771E U+654E
> 淸眞教 U+6DF8 U+771E U+6559
> 淸真敎 U+6DF8 U+771F U+654E
> 淸真教 U+6DF8 U+771F U+6559
> 清眞敎 U+6E05 U+771E U+654E
> 清眞教 U+6E05 U+771E U+6559
> 清真敎 U+6E05 U+771F U+654E
> 清真教 U+6E05 U+771F U+6559

Huh? How is this contributing to closure on Last Call on
the IDNA documents? And why is it cc'd to IESG and IAB?

For those who may be mystified, this is the Chinese word for
"Islam", qing1zhen1jiao4.

The ordinary way this would appear in a PRC dictionary is:

   U+6E05 U+771F U+6559

and not any of the other 7 permutations.

In a more traditional dictionary as might be seen in Taiwan
or Hong Kong, it might be printed:

   U+6DF8 U+771E U+6559

and not any of the other 7 permutations.

However, if you were using a Big-5 computer in Taiwan,
you would use the same characters as for the PRC for
this:

   U+6E05 U+771F U+6559

and not any of the other 7 permutations. (though the
fonts might vary in which glyph they show, in any case)

U+6E05 and U+771F, by the way, are examples of "traditional
simplifications" reflecting handwritten forms, that
predate the PRC systematic simplifications. The same two
forms are also used in Japan.

U+654E is another handwriting alternative for U+6559, but
it is seldom seen in printed material. U+654E is used in
the PRC, Taiwan, and in Japan alike.

All 6 characters have G, T, and K sources in 10646, and
4 of them have J sources as well. So for this kind of
overlap of forms, any suggestion to delete G-source-only
characters from the allowed set does nothing at all.

And lest this example be taken on its face value
as indicating a problem in "CJK UNICODE", it should be noted
that the presence of these alternate forms of the "same character"
in Unicode is due to the same distinctions being made in
legacy CJK character encodings in Asia. In particular,
note the following mappings:

For "GBK", Code Page 936 Simplified Chinese:

0x9C5B	0x6DF8	#CJK UNIFIED IDEOGRAPH
0xC7E5	0x6E05	#CJK UNIFIED IDEOGRAPH
0xB177	0x771E	#CJK UNIFIED IDEOGRAPH
0xD5E6	0x771F	#CJK UNIFIED IDEOGRAPH
0x949C	0x654E	#CJK UNIFIED IDEOGRAPH
0xBDCC	0x6559	#CJK UNIFIED IDEOGRAPH

And for "Shift-JIS", Code Page 932 Japanese:

0xEDE4	0x6DF8	#CJK UNIFIED IDEOGRAPH
0xFB43	0x6DF8	#CJK UNIFIED IDEOGRAPH
0x90B4	0x6E05	#CJK UNIFIED IDEOGRAPH
0xE1C1	0x771E	#CJK UNIFIED IDEOGRAPH
0x905E	0x771F	#CJK UNIFIED IDEOGRAPH
0xEDB1	0x654E	#CJK UNIFIED IDEOGRAPH
0xFACD	0x654E	#CJK UNIFIED IDEOGRAPH
0x8BB3	0x6559	#CJK UNIFIED IDEOGRAPH

So if you are working on a Windows system in either of
these legacy code pages, in China or Japan, you
already have the same options for representational
ambiguity, without invoking Unicode at all.

--Ken

> 
> L.M.Tseng
>