[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Re: Unicode is not usable in international context.



Liana's response to Alan Barrett's questions wondered off into 
advocacy of language identity coding in name labels, so let
me take another crack at the specific questions here.

> On Thu, 21 Mar 2002, Masataka Ohta wrote:
> > Unicode is not usable in international context. [...] Unicode is
> > usable in some local context. [...] However, the context information
> > must be supplied out of band.
> 
> Let me see if I can understand this argument about Unicode and local
> context.  I am an English speaker who can't tell the difference between
> the Chinese character that appears as the second character of the
> Chinese word for the city that I call "Beijing", and the Japanese
> character that appears as the second character of the Japanese word
> for the city that I call "Tokyo".  I believe that (as used in the city
> names) both characters mean something like the English word "capital".

Correct. The character you are talking about is U+4EAC in Unicode.
It was borrowed into Japan from China roughly during the Tang dynasty
period of China (along with thousands of others, as Buddhist monks
and other travelers brought the Chinese writing system to Japan).

> Say there's a Chinese character that looks (to uneducated western eyes)
> like a box with three legs and a hat, and a Japanese character that
> looks (to uneducated western eyes) like a box with three legs and a hat.
> Say the Chinese character looks slightly different from the Japanese
> character, but a Chinese person can easily recognise the Japanese
> character and understand its meaning in context, and a Japanese person
> can easily recognise the Chinese character and understand its meaning in
> context.

This is definitely and absolutely the case for this character. It
is a very common use character (after all, it appears in the names
of the capital cities of both countries), and is recognized by all
literate speakers of either Japanese or Chinese.

> As far as I understand, Unicode would say that these are not two
> different characters, but just different display forms of the same
> unified character (or whatever the correct technical terms are).

Two glyphs used to display the same character, yes.

> Display software would have to have out of band knowledge to help it
> choose between the Chinese and Japanese display forms.

Partly correct. As a character by itself, this would be true.
However, it is also possible to apply heuristics on rather short
strings of Japanese or Chinese characters to determine which
language is involved. The heuristics would, however, tend to
fail on short name labels as in URL's that might consist only
of Han characters (= kanji in Japanese).

But as others have pointed out, even determination of which *language*
is involved doesn't determine what presentation is preferable to
an end user. The general preference in Japan is to use Japanese-specific
fonts, *even for the display of Chinese*. If you look at Japanese
newspapers, they don't switch over to Chinese style fonts when
citing the names of Chinese officials or Chinese placenames, for
example. The issue is one of locale-based determination of typographical
preferences in use, rather than language-based determination of which
glyph to pick.

> As far as I understand, absence of out of band knowledge could lead to
> the hypothetical Unicode character <CJK character that looks a bit like
> a box with three legs and a hat> being displayed as if it were <Chinese
> character that looks a bit like a box with three legs and a hat>, even
> if the author's intent was to display <Japanese character that looks a
> bit like a box with three legs and a hat>.

It is true that if you guess wrong, or don't have appropriate context,
you can end up picking a Chinese style font for display when a user
would have preferred a Japanese style font, or vice versa. The usual
fix for that is to let the user set their preferences and then for
the software to honor them rather than just guessing.

But even then, this is not usually a problem. The average Watanabe Jiroo
will usually have just Japanese fonts installed on his machine -- that
is what he prefers and uses. The average Li Jianping will usually have
just Chinese fonts installed on his machine -- that is what he prefers
and uses. The question will only come up for people who have set up
multilingual options on their machines, with both Japanese-style and
Chinese-style fonts installed.

And one other issue: the terms "CJK character", "Chinese character",
and "Japanese character" are really all just different English
translations of the *same* words (kanji in Japanese, han4zi in Mandarin
Chinese), so in part the misunderstandings that arise about this
topic result from mistaken presuppositions about the validity of
the distinctions between the English translations of the terms.

> As far as I understand, Masataka Ohta considers this to be a fatal flaw
> in Unicode.  I hope he will correct me if I have misunderstood his
> objection.

I can't speak for Ohta-san about his perception of the fatality of
the "flaw" here, but he seems to be claiming basically that Unicode
plain text doesn't provide you unambiguous indication of the context
for choosing appropriate display, whereas ISO 2022-style code switching
gives you in-band context, since you are explicitly switching between,
say EUC-JP and EUC-CN, so you are picking between two distinct
encodings associated with different countries (and languages).

However, contrary to Ohta-san's implicit claims, having 2022 contexts
doesn't solve your problem any better than before. Suppose I have
an ISO 2022 stream of the following sort (schematically):

  ... <ESC to EUC-JP> tookyoo to <ESC to EUC-CN> beijing ...

(for "Tokyo and Beijing", where I deliberately switch to EUC-CN and
use the byte values from EUC-CN to encode the Chinese city name).
Now how do I display that to a user? If I just blindly follow the
2022 context clues and switch to a Chinese font every time I encounter
an EUC-CN stretch of text, and then display that to a Japanese end-user,
I will get exactly the opposite of what I claimed above is Japanese
typographical usage and end-user preference: i.e., just display
everything, including the Chinese names, with Japanese fonts.

So all this discussion about Unicode deficiencies and 2022 context
superiority is just smoke and mirrors. Use of 2022 code-switching
contexts don't help out with the *real* problem, which is to determine
end-user preferences for styles.

> 
> I don't know enough to tell whether the difference between corresponding
> Chinese and Japanese characters is analogous to a font difference or
> a spelling difference, but the "ignorant westerner can't tell the
> difference" test biases me towards the "font difference" side. 

In this particular case, we are talking about a style difference for
the *same* character. This is not even slightly in doubt.

For an exhibit of the issue, see The Unicode Standard, Version 3.0,
p. 631 for U+4EAC. The main code charts are printed with a commercial
*Chinese* font, so you can see Chinese stylistic details in it.
Then you can compare that with the same character printed on p. 927
of the standard, where you see the Shift-JIS index to Han characters,
printed with a commercial *Japanese* font, so you can see Japanese
stylistic details in it. U+4EAC is located on that page at 0x8B9E,
its encoding in Shift-JIS.

The main systematic difference between the glyphs in these two
instances is in the very top stroke on the "hat". The Chinese
style shows a "dian" stroke, a kind of teardropped shape that
results from the calligraphic tradition of making a "dot" by
lightly laying the brush down with the tip at the upper-left, and
then pressing down a bit to create the heavier rounded part of
the stroke. The Japanese style glyph instead shows a short vertical
stroke that results from the calligraphic tradition of making a
vertical stroke by starting with a dot, then lifting up the brush
a bit and moving it slightly directly down.

You can see that this is simply a systematic stylistic choice made
in this Japanese font (and similarly in other Japanese fonts which
follow this typographic tradition) by examining the hats on
other characters on the same page: look at 0x8B9C, 0x8B9D, and also
look at analogous top strokes on different components: 0x8B71, 0x8BF3,
for example.

There are other stylistic differences that can be noted between
the commercial Chinese and Japanese fonts, but these are down
at the level of detail that one would be concerned with, for
example, in comparing an Arial font with a Helvetica font for
Latin characters, for example.

> If they
> are analogous to spelling differences, then I would say that unifying
> the different characters was probably an error in Unicode, 

Not an error.

> but that IDN
> should not try to undo that unification.  Either way, I think that IDN
> should document the potential problem but not try to fix it.

Not a problem to be documented or fixed.

> (In contrast, I think I have learned enough by following the past <n>
> months of discussion to tell that the differences between Traditional
> and Simplified Chinese are analogous to spelling differences, and so IDN
> should not try to unify them.)

They are systematic differences yes, and should not be, and are not
unified in the character encoding. Comparing them to spelling differences
in English gets the quibblers a-quibbling, however, so is probably
not the best analogy to use.

Incidentally, to put the whole Chinese-style versus Japanese-style
issue in context, the amount of variation noted across the
various Arabic typographic traditions easily exceeds the kind of
variation we are talking about in the East Asian typographic traditions
for the presentation of Han characters. But nobody is rabblerousing
that Unicode is fatally flawed because it only encoded a single
Arabic character repertoire to meet the needs of all the Arabic
styles and all the languages, from Moroccan Arabic to Yemeni
Arabic, to Persian, to Urdu, to Uighur in China, to Malay that are represented
using the Unicode encoded characters. In fact, eminent participants
in Arabic font foundries are telling us that Unicode is precisely
the correct technological foundation on which to build sophisticated
Arabic rendering and display support for all those communities.

Or if the "ignorant westerner[s] who can't tell the difference"
would like an example closer to home, consider the variant
forms of Roman and Fraktur "A"'s in use (a recent topic on
the Unicode discussion list). Cf. the following German and Swiss
newspaper mastheads:

<http://www.harlinger-online.de/>   _A_nzeiger für Harlingerland
<http://www.giessener-anzeiger.de/> Gießener _A_nzeiger
<http://www.faz.net/>               Frankfurter _A_llgemeine Zeitung

All of these are LATIN CAPITAL LETTER A, as are the various Roman-style
"A"'s on the same pages -- although they *easily* exceed the level
of structural difference that usually characterizes Chinese versus
Japanese stylistic differences for the same characters.

From the *encoding* context (all of these would be represented by
ISO/IEC 8859-1, Latin-1), one cannot tell which glyph is correct
in which context. Yet it would be a clear and obvious mistake to
use the Fraktur style "A" from the Harlingerland newspaper to
present the name of the Frankfurter Allgemeine Zeitung -- this thing
is almost a trademarked font style, like the font that Woody Allen
uses for all his movies. ;-) The fact that Unicode alone (or
8859-1 alone) as a character encoding does not provide sufficient
context for international use of "A" while preserving all the font
style differences is wholly beside the point. In fact, the "A" gets
used exactly as it should, to convey the *character* identity, and
then font distinctions are used to carry all the rest of the
stylistic differences for presentation to end users.

--Ken

P.S. Sorry for the bandwidth somewhat off-topic. I'll go back
to lurking again. ;-)

P.P.S. My apologies in advance to Ohta-san for using some 8859-1
characters in this text, which he may reject with his boiler-plate
comment: [Charset XXX unsupported, skipping...] If his Japanese
email client can't handle 8859-1 characters for a discussion of
German, even though my email client is 8859-1 based and can,
I'd consider that an argument *for* switching to a Unicode-enabled
email client. :-)