[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] I-D ACTION:draft-ietf-idn-jpchar-00.txt



Some comments on draft-ietf-idn-jpchar-00.txt:

> Despite of IDN WG rough consensus that character set in multilingual
> domain name is UCS [UCS], most popular Japanese character set used in
> Japan is Japanese Industrial Standards X 0208 -- hereafter abbreviated
> as "JIS" -- [JISX0208].  This means that many of PCs and most of PDAs
> including handy phones in Japan can display only JIS and ASCII.
> Therefore, Japanese characters used in multilingual domain name are
> strongly recommended as common part of JIS, ASCII and UCS.

I think you meant "common part of <JIS X 0208 + US-ASCII> and UCS". UCS
includes all of US-ASCII, JIS X 0208 and JIS X 0212.

Taking this draft and the other one posted about the same time
by the same authors, it seems to me they have in mind a simple
encoding of JIS X 0208 and US-ASCII distinguished by a unique
race-like prefix. I strongly disagree with this approach. There
are thousands of languages in the world and dozens of scripts,
and it would create a digital Babel with humungous interoperability
problems.

In order that a hand-held device not have to include mapping tables,
it should operate in the same encoding as IDN. As production of
such items is for world markets, this suggests use of UCS rather than
national encodings like JIS. It is true that with UCS the mapping
from code to glyph is sparse; the ROM required for this is still
small compared to that needed for the glyphs themselves. There is
also a devious packing scheme on the Unicode CD-ROM if you really
want to save memory. But maybe in practice the same phones will be
made for China too, and then the problem goes away.

Asking network software to behave one way below one domain (.jp) and
a different way below another (.cn) is not necessary. One can
use UCS and take care to register only names under .jp which are
in either US-ASCII or JIS X 0208. In the long term, even that
will become technically unnecessary, though it may still be
desirable for other reasons to restrict the characters used.

> JIS has a lot of characters including graphical and compatible
> characters.  But as for domain name, significant characters to
> represent names are Kanji, Hiragana and Katakana [CJK].  Therefore,
> according to the principle, Japanese characters in multilingual domain
> name MUST be Kanji, Hiragana and Katakana in JIS.

This is a policy decision that can be made by the owners of .jp (for
whom I assume the draft's authors work). But in the UCS schemes
discussed here, romaji could be allowed in names too. When you register
a name, you probably run some program which goes off and checks the
name for all kinds of things - rude words etc. These restrictions should
not be at the network level because that makes them hard to change.
As another example, if you decide that initially no X 0212 characters
should be allowed because clients don't support them, you can choose when
to relax that restriction without having to alter networking software
on every node. And you can do so immediately, or ad hoc for certain
names.

> 2. Canonicalization rules of Japanese characters in multilingual
>    domain name labels
> 
> In this section, this document describes two parts of canonicalization
> rules.  One explains "localization", and the other comments on
> "internationalization".  In other words, one is for Input/Display
> level, and another is for API level [IDNA].
>
> 2.1 Localization: Characters to be canonicalized before NAMEPREP

I proposed here about 3 months ago that UCS compatibility characters,
which include the full-width and half-width variants you mention
here, be folded or disallowed. I thought the rough consensus then
was for folding; the NAMEPREP draft has not been updated to reflect
that. I went on to argue that the characters should be folded at
application level and disallowed at protocol/resolver level for
robustness, but I don't think that had wide support. I agree that
most current Japanese users will expect to be able to type in the
variants and the folding must be done somewhere.

The folding doesn't need to be *localized*; it can be done regardless
of the language. In general, the width variants simply won't occur
for languages other than Japanese, but if they did they should also
be folded.

> The file "idntabjpcanon10.txt" defines compatible characters, with
> additional canonicalized character code as 3rd field; that is, mapping
> table of FULL-WIDTH Alpha-numeric to ASCII, and HALF-WIDTH kana to
> Katakana.

I notice that you don't map the z-variants. I presume this is
under discussion by the nameprep design group.

> Recommended order of applying canonicalization rules is as follows:
>
>        (1) "idntabjpcanon10"
>        (2) "idntabjpcom10" [presumably "idntabjpcomp10" was meant]

These correspond to mapping different spellings of Unicode characters
(canonicalization form C) and to folding Unicode compatibility
characters. Together they correspond to normalization form KC,
and the order doesn't matter as the tables from the consortium are
consistent (at least I checked them for 2.1 and that is the clear
intention of the design).

> Furthermore, for historical reasons, JIS have many compatible code
> points in Kana and Alpha-numericals.  Such compatible code points are
> still used widely, so that these characters SHOULD be acceptable
> especially in user interface, and MUST be canonicalized before
> transmission to the wire. The former half should be implemented for
> localization, and the latter half must be implemented for
> internationalization.

I don't think we want to make a distinction between localization 
and internationalization. Suppose I am using an English-language 
application (for the sake of argument, a browser in an Internet 
cafe in Nepal), speak Japanese, and want to access a site in 
Japan. I don't want the domain names to be treated differently 
even though my menus are in English. As another example, 
currently email messages don't even have a language tag, and the 
n million copies of sendmail aren't all going to be changed to 
look for one.