[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Internet Draft - Phased Implementation for IDNA



Ben asked:

> >Dear IETF IDN members,
> 
> >The attached draft  "Phased Implementation for IDNA" was submitted.
> >You are welcome to commont on the draft.
> 

> After reading the I-D, my question is-  Is it possible to modify the
> Appendix B so that it *only* prohibits the "Simplified Chinese code
> points" and allow all the rest to move forward?

And I think the answer is clearly no.

The problem, of course, is that the very term "Simplified Chinese
code point" is unclearly defined. It is certainly not defined by
the Unicode Standard. The proposal to use the source set tables
in 10646 to define it by listing those characters that have
G-source-only encompasses most, but not all of the formal list
of simplifications mandated by the PRC, but also includes scads
of traditional forms. The term "S-CDN" is bandied about in this
I-D, and even in the tsconv I-D, as if this is well-defined.

But the Chinese writing system is sloshing around in an ancient
ocean of Chinese lexicography and millennia of character usage
change. There is no quick fix engineering solution that gets
the "right answer" for this.

Take just one among many examples of the problem:

U+5F53 dang1 is a common usage character meaning "equal; to work
              as, serve as, be..."

U+7576 dang1 is the traditional variant of the same character.

So, U+5F53 is the simplified form of U+7576. We prohibit U+5F53
by adding it to Appendix B, but allow U+7576, and we are done,
right?

Wrong.

U+5F53 also is onomatopoetic for the sound of a gong or bell.
In that sense it is *also* a simplification of the traditional form
U+5679. (And historically, of course, U+5679 was just an innovation
adding the mouth radical to U+7576, to distinguish the "clang" sense
of dang1 from the "equal" sense of dang1.)

But then we also have U+943A dang1 "clank, clang", which has the
same component, but with the metal radical, instead of the mouth
radical. And *it* has a simplified form U+94DB dang1. Again,
historically, these are the same word as the onomatopoetic U+7576,
but are variants separated out by attachment of a radical.

But wait, there's more. It turns out that U+5F53 is not a recent,
systematic PRC simplification mandated by the ministry tables
of 1986 published in dictionaries (although of course it does
appear there). It is an older example of a "traditional
simplification" ([U+4fd7][U+5b57] su2zi4) that got carried into
the systematic PRC simplifications. And guess what, U+5F53 is
the *ordinary* character used in Japanese, which by and large
avoids all of the PRC simplifications for historical reasons.
U+5F53 is not only the ordinary character in Japanese -- it is
in the Jyooyookanji, the short list of characters *required* for
Japanese and learned by all Japanese school kids in their
basic education.

So just in this one example, we see both the inherent complexity
of variants of "the same character" in Chinese, and an
illustration of the fact that trying to eliminate the
"SC" from the allowed set of characters for IDN would immediately
cripple the repertoire for use in Japanese.

I am fully sympathetic with the expressed Chinese requirement that
Chinese expressions for the "same thing", whether represented
in the "simplified" orthography or the "traditional" orthography,
should match transparently to a user. For example:

U+7576 U+6B78 dang1gui1 "Chinese angelica" (traditional characters)

should match

U+5F53 U+5F52 dang1gui1 "Chinese angelica" (simplified characters)

for the <Chinese angelica emporium>.tw website. But while you
could probably match the whole words, run them through any
of a number of commercially available TC<-->SC transducers,
and get the match you want, if you just try to do code point
matching, all hell breaks loose around the edges, and your
scheme is fatally flawed for Japanese (and Korean) use.

By the way, U+6B78 gui1 *also* has a traditional simplification
used in Japan (U+5E30), which is not the same as the U+5F52
simplified form used in the PRC. But don't get me started on gui1,
as I might have to start talking about turtles... ;-)

--Ken