[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Re: Unicode is not usable in international context.



> People have posted cases where the number of TC characters that make 
> up a word is different from the number of SC characters that make up 
> the same word.  People have also posted cases where the number of 
> characters remains the same but the mapping depends on context. 

You are talking about combinations of glyphs, we call them
words or Labels in IDN context.  A label may contain only one 
character sometimes, and it is easier for us to discuss without
have to show the glyphs on our screen here on this list.   This may 
be the reason for the above confusion. 

Let me try to elaborate a little on Chinese character processing, 
I mean one glyph at a time as in Unicode table.

There are 20,000+ often used symbols worldwide as they are
collected in Plane 0 of UCS.  Mainland China use about 7,000
reguarly, Taiwan uses 13,000 reguarly.  We call these are 
frequently used characters.  

Among these frequently used characters, there are always 
semantic differences among any two given characters, due to
history and locality of the usage, as you can image.
This posts the necessity of organization works in characters 
and thus dictionary editing, or standard works through out 
Chinese writen language histroy.  Especially when computer
comes along. 

The first classification is certainly those semanticly
distinct and writen form distinct characters.  Many semanticly 
distinct but form similar characters are in the carefully explained 
category in education sector as well as in written language critics,
which can be analogies to Latin spelling checking as an 
educational activity, but not as an equivalent symbol set concern
which is discussed here on this list.  This is certainly a spelling
checking feature as input concerns.

The second classification are those semanticly overlap but not
the same.  We translated the category for these character  as
synonymies or "same meaning characters".  But they are 
not thesaurus, thesaurus are introduced into China in recent 
years only.  These are not the subject for any unification or mix.  
The correct usage in a text has to be differenciated within context,
thus word dictionaries are used to help, this is an  AI feature in an 
Editor software. 

The third classificantion of characters are semanticly "identical" 
but has many different forms, especially through the long history 
of keeping these characters.  The majority of characters in UCS 
beyong the above 20,000 frequently used characters are belong to 
this category.  And the majority of TC/SC also belong to this category.
As I have said, if you want to find the details of something different
you will always be successful among these characters.  

This third classification is what we are concerning in preference 
of display and possible inclusion of more character forms beyong
the frenquently used character set or exclusion of them from such
an equivalent set, which has addressed by Japanese users.  

Notice, that I said the majority of TC/SC belongs to this category. 
This is what you have heard some user do not agree with this 
"identical" classification.  This is the matter of life that the Han 
user community has to be precise about which symbol 
in which set.  So that they can have a standard to work from.  
Excluding this equivalent set from the basic [nameprep] profile
is definitly marking the failure of IDN and causes more "trademark"
conflicts on the way. 

> People have stated that conversion between TC and SC requires a 
> dictionary of words,  rather  than a table of characters.  All these 
> show that TC/SC is analogous to a spelling difference.

Correct. This is to deal with the small number (on the scale of 
10 vs. 2000) of TC/SC in input  and display level, which should not 
over take the face that TC/SC has to be equivalent identifiers in 
[nameprep]. 

As the matter of using characters as identifiers in IDN, the job we
have to be concerned is to reduce these semantic "identical" 
characters from whatever number down to a "no trademark confilct"
level of clearance to be viable symbol set  which we can permit 
in IDN for identifier matching.  In this sense, it is like
case-insensitive
treatment of Latin symbols.  Yes, we do want uppercase too, but
on identifier level, they are the same!

> I don't clain that TC/SC conversion or equivalence is not a problem.

I hope that the above explaination has shown a feasible solution to this 
problem to your satisfaction. 

> Neither do I claim that the potential confusion between <GREEK 
> CAPITAL LETTER ALPHA> and <LATIN CAPITAL LETTER A> is 
> not a problem.   

This is a problem of IDN.  This problem is opposite with TC/SC 
equivalence.  Because of these symbols are picked up/ pasted/
typed from a mixture of applications and user interfaces, anyone of 
them can be the bad guy hidden from someone's eye, and the 
machines only know about bits.  If you add more forms of encoding,
such as UTF8/16 or input keystroke sequences, then the problem
can escalate quickly. 

The solutions that  I can think of at this moment would be two: 
1. Unification of symbols like CJK unification with equivalent symbol
 set defined; 
2. Transparent language tag to enforce each label to be consistent 
 with its tag through out the system include DNS.

If we work out CJK in IDN problem, then this will be a piece 
of cake at the end of our IDN banquet :-)

> Neither
> do I claim that the potential confusion between English "theatre" 
> and  American "theater" or English "lift" and American "elevator" 
> are not  problems.  But I believe that all these problems are 
> outside the scope  of IDN.

Correct, this problem is outside the scope of IDN.

Regards,

Liana Ye