[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Re: Agenda Item for next UTC: Normalizing Case Mapping

To: James Seng <jseng@pobox.org.sg>, unicore@unicode.org
Subject: Re: [idn] Re: Agenda Item for next UTC: Normalizing Case Mapping
From: "Martin J. Duerst" <duerst@w3.org>
Date: Thu, 23 Mar 2000 15:58:15 +0900
Cc: idn@ops.ietf.org
Delivery-date: Wed, 22 Mar 2000 23:41:38 -0800
Envelope-to: idn-data@psg.com

There are indeed equivalences among CJK characters
that are similar to case equivalences. And these should
be carefully considered in our effort.

However, to just ask the UTC to define the equivalences,
or to just try to define them on our own, won't work.

It is very important to understand the nature of these
equivalences, not in terms of language/semantics/whatever,
but in terms of their structure.

For case equivalence, the Turkish i/I is largely an
isolated case. For CJK characters, even just within
Chinese (simplified and traditional), there are many
cases where the equivalence is not one-to-one. In particular
when converting from simplified to traditional, it's
easy to get things wrong.

The solution I currently see as most appropriate is
therefore:

- Equivalences e.g. between all-simplified and all-traditional
   names (on a per-component base) can be handled by CNAME/DNAME.
   I do not think that mixed usage is very important.

- To solve the problem of somebody else registring
   something, it makes sense to create a database that contains
   equivalences in a rather loose sense, and to check new
   registrations against old ones using this data, and resolve
   detected conflicts by hand. Such equivalence data is available
   in various forms, and doesn't have to be validated because
   it has no final say.


Regards,   Martin.

At 00/02/18 16:31 +0000, James Seng wrote:
>Brendan Murray/DUB/Lotus wrote:
> > James Seng wrote:
> > > Would it also be out of range if we consider case folding for Asian
> > language?
> > > Simplified-Tradition Chinese. Simplified-Tradition Japanese. or Hiragana,
> > > Katangana single full/half width etc etc...
> >
> > The above are, I believe, beyond the scope of casing: they are, however,
> > admirable suggestions and should be addressed. The width and kana mappings
> > should be pretty much given, although I suspect that the normalization of
> > Han characters may prove to be somewhat more contentious.
>
>Consider the following domain name.
>
>U+7535 U+90AE '.' U+53F0 U+6E7E  (mean email.taiwan in Chinese)
>
>It can also be represented in the traditional form
>
>U+96FB U+90F5 '.' U+81FA U+7063
>
>To say U+7535 U+90AE '.' U+53F0 U+6E7E != U+96FB U+90F5 '.' U+81FA U+7063 is
>as good as saying email.tw != EMAIL.TW.
>
>But why should UC bother with Chinese 'case' folding? Afterall, this is a
>problem unique to DNS and we should let it be handled it in DNS aliasing via
>DNAME and CNAME, e.g
>
>U+96FB U+90F5 '.' U+81FA U+7063 IN DNAME U+7535 U+90AE '.' U+53F0 U+6E7E
>
>And lets try repeat it 2^3 times for different permuation...I guess we are
>lucky since we need to repeat only 8 times for this name. Perhaps we might be
>even luckiler for other longer name.
>
>I agreed this is not 'case folding' as one would normally associate with the
>meaning of 'case'. But the problems are as real as I = dotless i. It is going
>to be very difficult to address this issue, especially with CJK unification
>(what is considered equivalent in Chinese may not be so in Japanese or
>Korean). But it is definately better to address it in UC than in IDN-WG, IMHO.
>
>-James Seng

Prev by Date: Re: [idn] universal typability
Next by Date: [idn] CIDNUC in action
Prev by thread: Re: [idn] Re: Agenda Item for next UTC: Normalizing Case Mapping
Next by thread: [idn] [Fwd: Agenda Item for next UTC: Normalizing Case Mapping]
Index(es):
- Date
- Thread