[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Re: Document Status?



Simon Josefsson <jas at extundo dot com> wrote:

> + The entire world doesn't use Unicode, which is where IDNA starts.
>   There are examples of characters in european charsets that may fail
>   to translate into Unicode properly (e.g., greek beta and german ss
>   in CP437).  I suspect this might be more common in non-western
>   charsets.  If someone has looked into this area closer, I'd
>   appreciate a pointer.  The IDN specifications surely doesn't deal
>   with it.  Detailed scenario that fails: www.ßeta.com browsed from
>   CP437 platform.

This is not a case of a character "failing to translate to Unicode
properly."  That would be true if we were talking about a Glagolitic
letter or Egyptian hieroglyph.  This is a case of ambiguity between
legacy character mappings.

The argument here is that Unicode should not be used for intersystem
communication because characters in legacy character sets have a history
of being overloaded.  Arguments of this type are gradually disappearing
as people begin to realize that the problem exists with or without
Unicode.

The alternative would be to stick with legacy character sets and tag all
IDN's with the character set.  This doesn't solve the overloading
problem, since non-CP437 applications still must decide whether to
convert CP437 0xE1 to German ss or Greek beta.  And you would
additionally run into the ISO 2022 problem, in which applications are
expected to implement the entire repertoire of character conversion
tables, but most don't because of the overhead.  Are you sure that every
Unix, Linux, Windows, and Mac client would provide a mapping table for
CP437?

Unicode publishes informative mapping tables to various legacy character
sets, including CP437.  They map CP437 0xE1 to U+00DF LATIN SMALL LETTER
SHARP S.  Although these mappings are only informative, an application
could reasonably be expected to implement them.  (In any case, it must
be said that CP437 is a lousy choice for interchanging Greek text -- 737
or 869 would be better -- so the mapping to German ss is hardly
counterintuitive.)

> + The choice of Unicode normalization KC has been questioned.  Again
>   since I'm familiar with european charsets, I have the simple example
>   of normalization of ß into ss.  There are supposedly distinct words
>   where this normalization process removes the possibility of
>   distinguishing between the words.  Non-western charsets probably has
>   more cases like this. Detailed scenario that fails: www.masse.de
>   (translation: mass, majority) and www.maße.de (translation: metrics,
>   gauges) are indestinguishable.

If NFKC were not used, the security mavens would be out in droves,
complaining about the numerous lookalike characters that could be used
to fool unsuspecting users.

Most languages have "minimal pairs" of words that differ only in case.
Few people complain that the existing DNS cannot distinguish between
words such as "mark," what you do with a pen on paper, and "Mark," a
proper name.  The same is true for Masse and Maße.  The DNS is not
supposed to handle "names" in languages on this level.  It is
unfortunate that the term "domain name" has come to convey the meaning
of language-aware names, since domain names are merely identifiers,
hopefully more mnemonic than strings of integers separated by dots.

> + Any modifications to the Unicode code charts or normalizations
>   tables destroy stability of IDN.  This is handled by locking IDN to
>   Unicode 3.2 (I believe).  I haven't a specific scenario which fails,
>   but this will add considerably code complexity in software since
>   they need to implement Unicode normalization, Unicode tables and
>   possibly also Unicode bidi algorithms internally instead of relaying
>   on a unicode implementation in the operating system.  It may seem
>   like this problem can't be solved in a better way, but I have a
>   feeling a better design could fix this.

Adding more characters to Unicode does not destroy the stability of
anything, because the normalization tables are supposed to be frozen and
new characters will not have compatibility decompositions.  I know there
have been changes to the normalization of two characters since Unicode
3.0 -- not exactly an overwhelming number -- but part of the rationale
was that applications such as IDN that require such stability were not
yet available.  It is extremely unlikely that we will see any changes to
normalization after IDN is rolled out.  We will, of course, see
thousands of new characters, but as I said, that causes NO stability
problems.

> + Unicode normalization and bidi rules interact problematically.
>   Consider the string U+05D0 U+0966, it is a forbidden bidi sequence.
>   U+2135 U+0966 is not a forbidden sequence.  Yet, U+2135 is
>   normalized into U+05D0.  Detailed scenario that fails: STRINGPREP on
>   string www.U+2135U+05D0.com work from machines that doesn't perform
>   Unicode normalization (which IS optional).  Yes, OK, this is not a
>   IDN problem but a STRINGPREP problem, but STRINGPREP seem to come
>   out of this group so it is related.

Stringprep, the generic algorithm, makes normalization optional.
Nameprep, however, is a specific profile of stringprep that *requires*
normalization form KC.

U+05D0 is HEBREW LETTER ALEF.
U+2135 is ALEF SYMBOL.
U+0966 is DEVANAGARI DIGIT ZERO.

If you create a domain name consisting of some combination of these, and
it does not work on some systems due to differences in normalization
and/or bidi, you are unlikely to garner much sympathy for your choice of
domain names.  This is a pathological example and you know it.

To paraphrase Churchill, Unicode is the worst character encoding
standard except for all those others that have been tried from time to
time.

-Doug Ewell
 Fullerton, California