[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] I-D ACTION:draft-ietf-idn-cjk-00.txt



Some comments on:

> 	Title		: Han Ideograph (CJK) for Internationalized Domain Names
> 	Author(s)	: J. Seng et al.
> 	Filename	: draft-ietf-idn-cjk-00.txt
> 	Pages		: 9
> 	Date		: 13-Sep-00

SECTION 3, Chinese:

> Hence, almost all Han ideographs are associated with some meaning by
> itself which is very different from most other scripts. This causes some
> confusion that Han folding is a form of lexicon-substitution.

Well, a look-up table mapping those ideographs which are also
words is also a lexicon. The test is: is the table the *same* for
all possible languages using the characters? If not, the table
is a lexicon. I don't know enough about the use of written
Chinese to know if the written languages used in the different language
areas (Mandarin, Cantonese and half a dozen others) are just close or
are identical enough to use the same table. But probably different tables
would be needed for Japanese and ancient Vietnamese than are needed
for Chinese; are we prepared to guarantee they are *not* for all
possible languages, past, present and future?

> In domain names, we are particularly interested in is to equivalences
> comparison of the names, and not converting SC-to-TC. Therefore, for
> this purpose, it is possible that equivalency matching be done in the
> TC-to-SC folding prior to comparison, similar to lower-case English
> strings before comparing them, e.g. 'taiwan' SC {U+53F0 U+6E7E} will
> match with TC {U+81FA U+5F4E} or TC {U+53F0 U+5F4E}.

1. Languages other than Chinese may use these characters. There
are probably a few such languages in mainland China which the
PRC government is in no hurry to document. How can you guarantee
that a TC character has not gained some separate meaning there?
How do you know it won't in mainstream Chinese in the future?

2. Why should writers of traditional Chinese lose shades of meaning?
This is surely what is happening if the mapping is many-to-one. It
would be equivalent to removing all the French-derived words from
English just because they almost duplicate Anglo-Saxon ones. That
would make me sad but not melancholy.

SECTION 4, Korean:

This section would be more understandable to users of the Roman script
if you simply said that the Jamo correspond to our letters, there being
both vowels and consonants (but more than we have), and syllables
can be written with their Jamo packed together in a square form called
a Hangul, most of which also have Unicode code points.

Unfortunately, Unicode 2.x made the Hangul the primary form and
Unicode 3.0 compounded the problem by removing the compatibility
decompositions for double-consonants (to the sequence of two
consonants), presumably to make canonicalization more efficient.
Initially I think accepting only the Hangul (and not the Jamo)
should work for modern Korean, as Unicode is supposed to have
code points for all modern syllables. The answer to this may be
in the Unicode 3.0 book - I still have only the 2.0 one here.

SECTION 5, Japanese:

> Katakana is a mirror of hiragana with few more forms

I seem to remember seeing somewhere katakana forms for VA, VI, VO, VE,
though, yes, they are not commonly used. They can be represented in
Unicode using the voiced diacritic 0x3099 after the corresponding
unvoiced syllable.

> and they are used to integrate foreign words or phrases into
> Japanese, or to emphasize words or phrases even in Japanese, or
> to represent onomatopoeia.

These are similar to the situations in which italics are used in
English. Unicode doesn't provide a way to do italics - unless you
count the nonspacing underscore diacritic - but it does distinguish
the two kinds of kana. Since we are presumably not going to allow
italics, this is an inconsistency which we or registering authorities
should probably correct. At least, though, the two forms look
different and if a font is available for one it will probably be
for the other, so such duplicates aren't a serious *technical* problem.

> If Japanese uses hiragana and katakana only, then it is fairly obvious
> that written Japanese is going to be very long.

The main problem is that the kanji capture shades of meaning which
are not present in spoken Japanese once you remove inflection,
gestures, facial expressions etc. Allowing only the kana is not an option.

SECTION 4 [bis], Vietnamese:

> While Vietnamese also adopted Chinese ideographs ('chu han') and created
> their own ideographs ('chu nom'), they were now replaced by romanized
> 'quoc ngu' today. Hence, this document does not attempt to address any
> issues with 'chu han' or 'chu nom'.

A department of classics in a Vietnamese university (or even a department
of Vietnamese in a US university) could reasonably expect to use these
characters in its names. One cannot just dismiss an entire language
because it happens to have fewer speakers than one's own or one
doesn't know much about it.

SECTION 7, Mechanism

> c) Folding by Domain Name registration services for the purposes of
>   preventing confusing allocations CJKV Domain Names which would,
>   if transcoded, be the same

How does this differ from the question of US/UK spellings in English?
We don't expect the software to detect the duplicates "colour" and
"color" any more than we expect it to recognize "seven-up" as a
trademark infringement of "7-Up" or to disallow certain naughty
words (I am told the people who vet ".com" have a list of seven
of those). These things are done by humans; they probably do use
software to help them, but that software is not installed on every
host, its knowledge was given to it by lawyers, not networking
engineers, and it tracks community standards and laws, not RFC's.
Such a system for Chinese would be useful and a fine thing to
base a business on, but I don't believe we should be considering
embedding these things into every host on the net.

---

I think the draft is far too Sino-centric.

As I have argued in several places above, the kind of folding proposed
by the draft depends on language, local laws and customs, and therefore
each zone must develop and enforce its own rules. IMO, the semantic
folding proposed in the draft should be declared out of the scope of this
group. We should concentrate on providing a unique tag, and not try to
guarantee unique meanings for those tags.