[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Re: Unicode is not usable in international context.



> (In contrast, I think I have learned enough by following the past  <n>
> months of discussion to tell that the differences between Traditional
> and Simplified Chinese are analogous to spelling differences, and so 
> IDN should not try to unify them.)
> 

Hi, Alan Barrett:

It is great to hear you have such an impression from the 
past months discussion on the list.  This shows that 
the communication of TC/SC problem in IDN has marked
the differences in our understanding of the real world
problem, and this list is not dominated by Chinese and 
Korean users at all.

As a Chinese user with limited education in Han character 
culture, but has been interested in computation in Chinese
character processing in the past decades, I like to post
a different picture for your reference.

Chinese characters are larger character set than that of 
Latin (20,000+ vs. 52 minus those ancient  symbols) and 
demonstrates much larger range of symbolic 
representation phenomenon of human languages. 
These phenomenon shall be and has to be classified 
into different levels to be handled by a computational 
process, otherwise there is no way you can see any 
characters on a computer screen. 

The problem of Chinese character processing or we 
call it "Chinese information processing" in the 1970's
and "dictioary lookups" dated from centries ago, was 
faced with this classification problem started from Qin
dynasty, when our ancestors tryed to communicate 
among different kingdoms.  Now, We are still faced with 
the same problem on how to classify these characters 
to make it properly handled in IDN.

To generalizing and call TC/SC is analogous to spelling 
differences is wrong, because they are equivalent 
in NAMEs as ONE unit of screen display, while spelling 
difference in Latin are equivalent in NAMEs as MORE 
THAN ONE unit of display, such as "color" vs. "colour". 

This difference makes the whole processing of computer-
human text interface into two different category as we are
discussed on this list.  

To follow the basic concept of ONE display unit on our 
screen, we are discuss allowing one unit of display to be
extented from 52 ASCII characters to 50,000 USC characters.

1. Do we need all 50,000 USC identifiers for IDN? Do we 
  need 50,000 x 3 characters as some Latin users trying to
 do? 

No, we do not need so many identifiers.  That is the idea of
defining equivalent character sets comes in.  That is the gist
of TC/SC must be in IDN debate.  

2. Can we handle more than 50,000 characters? 

Yes.  It is one level above in localized user interface to deal
with. 

3. Let us say that by using equivalent character set, we 
have dropped 50,000 down to 30,000 USC symbols, can 
we separate USC character's language context to avoid 
DNS level confusion? 

a). Some say, we don't care, that is not IDN problem, that
 is other group's problem.  We only want IDN passes DNS  
 without glitches. 

b). Some say, we care, USC code point gives that symbol
 back.  The only viable solution is to use UTF8 or UTF16 
 to retain the original glyph. 

Your position may be consistent with a).  I think we are different 
in how to divider Chinese characters into different levels for
processing.

People agree with b), have not taken the problem of 
look-alike symbols across different language boundary 
seriously as it may sound, in addition to other comments 
already repeated here.  

c). The Chinese group wants TS/SC in IDN rooted in 
experences in Chinese character processing and usage 
in the last three decades.  I am not deny that some of 
them also somewhat like the idea of b).

For internet stability and security maintenance, the only 
identities cannel has to be in DNS.  Without character's 
language context, it is impossible to separate Greek/Cryllic/
Latin/Armenian characters consistently cross many levels
of processing of different user applicantions, and the best
examples are CUT & PASTE.   The DNS has to have 
language context information for machines, the 
users as well as system administrators - that is 
transparently in the form of language tags, not some 
hidden tags just for machines. 

It has to be an integral part of a name label to function
correctly through all the cut&paste operations.  Sorry, that 
I shall stop here, since this discussion is far out of the
group's scope, my posting right can be taken away 
soon.

Regards,

Liana Ye