[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Re: Agenda Item for next UTC: Normalizing CaseMapping



Paul Hoffman / IMC wrote:
> Could you explain why? I'm looking at the code charts, and U+7535 looks
> nothing like U+96FB; 

U+7535 and U+96FB are related to each another in Chinese by what we mean
Traditional-Simplfied Ideograms. 

Chinese characters are based on 'drawing'. In the older days, when one refer
to "lightning", we use U+96FB. As lightning only happens when it is raining,
U+96FB ("lightning") have U+96E8 ("rain") on top. 

But as U+96FB ("lightning") gets very complicated to remember and write (well,
imaging teaching a 7yrs old the basic 4000 ideograms needed to read/write
simple chinese), a simplified version was defined by dropping the U+96E8
("rain") from the U+96FB giving us U+7535. In fact, all various glyph related
to "rain" (U+96E8) glyph was also simplified, for example "cloud" was
simplified from U+96F2 to U+4E91. Of course, many other ideograms was also
simplified e.g U+4E9E to U+4E9A.

However, to the Chinese both U+96FB and U+7535 refers to "lightning" and both
are technically the same and we continue to use both forms. In China and
Singapore, we would use simplified "lightning" ie U+7535 and in Taiwan and
Hong Kong, we would use traditional "lightning" ie U+96FB.

Lets not get into simplified "simplified ideograms" which is popular in China
nowadays...

> U+90AE looks only somewhat like U+90F5 (but is
> completely distinguishable). If you are saying "people might be as likely
> to write U+7535 U+90AE as they are U+96FB U+90F5 for the same word", that's
> similar to saying "people might write Duerst for Dürst". What they see on a
> display (paper or computer) is not confusing here.

If look is a way to determine "similar", then "A" and "a" looks grossly
different to one who dont know English. I mean how can "pyramid with a line
across" is the same as a "circle with a line down"????

> >U+96FB U+90F5 '.' U+81FA U+7063 IN DNAME U+7535 U+90AE '.' U+53F0 U+6E7E
> 
> That is a local decision by the administrator for the domain in question (I
> can't tell which one you are saying is the base here).

It is not a local decision. It is critical.

If U+96FB U+90F5 '.' U+81FA U+7063 belongs to one company and U+7535 U+90AE
'.' U+53F0 U+6E7E belongs to another, there would be horrible confusion. The
consequences is similar to email.tw and EMAIL.TW belonging to different
companies.

> The example you give here sounds like a problem of language synonyms, not
> one of script properties, which is what case is. Have I misinterpreted the
> example you gave?

Yes you have.

Now, to understand the how stupid CJK unification is, you need to track back
how CJK gets similar ideograms in the first place.

Ideograms travels from China to Korea, then from Korea to Japan about 3-4
times in the last 2000 years or so. Therefore, you can actually see the snap
shot of Chinese ideogram development by studying Japanese kanji at various
times. Japanese kanji was also subsequently smaller compared to China *BUT* it
is not a subset as Japanese also defined their own ideograms.

This leads to some Chinese and Japanese having different ideograms for same
things. For example, Japanese for Buddha ("butsu") U+4EFC is different Chinese
for Buddha ("fo4") U+4F5B. In fact, U+4EFC only exist in Japanese.

It also means Japanese kanji may contain ideograms which is traditional form
of Chinese but it does not map to the simplified form of Chinese because the
simplified form is unknown to Japanese. For example, book ("sho") in Japanese
is U+66F8 which is the same ideogram for book ("shu1") in Chinese traditional
form. But the Chinese simplified form for book U+4E66 is unknown to Japanese.
Even if the simplified exists, it may take on different meaning (example, see
below.)

If things are not interesting enough, some ideograms was slightly modified
during the transfer.  For example, my last name "zhuang" (mean "villa")
traditional form is U+838A. In Japanese, it is "sou" (mean "villa") but it is
is slightly different ideogram U+8358. So it remains a question/debate if
U+835A = U+8358. 

Now, simplified form of U+838A ("villa") in Chinese is U+5E84. U+5E84 also
exist in Japanese but it takes on a different meaning "level". So if you
U+8358 = U+835A, and u simplified it to U+5E84, then everything went crazy...

How fun :-)

-James Seng

ps: I am writing this in my hotel room without my Unicode book and only armed
with my CJK IME and my UTF5 convertor (which allows me to get the code point
in Unicode by reading it directly) so pardon me if I make any mistake.