[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Inputting mixed SC/TC (Re: [idn] A question...)



Adam M. Costello writes:
> The reason IDNA does case-folding is to be consistent with the existing
> standard for domain names, which says they are case-insensitive.

What the existing standard actually says is ``domain name comparisons
for all present domain functions are done in a case-insensitive manner,
assuming an ASCII character set, and a high order zero bit.''

Similarly, the Internet mail standards specifically require that bytes
in message headers---including domain names---be interpreted as ASCII
characters.

Complete consistency with the existing standards would mean continuing
to use only bytes 0-127, continuing to interpret those bytes as ASCII,
and continuing to compare names as case-insensitive ASCII names.

But we don't _want_ to follow those rules. We want to see glyphs that
simply aren't available in the ASCII character set.

Of course, we have to maintain INTEROPERABILITY with all strings used
today, so we'll have to continue accepting A-Z and a-z as equivalent.
But there are many possible equivalence rules for non-ASCII strings.
Here are several examples---certainly not a complete list:

   (1) Exactly what software uses now: no equivalences outside ASCII.

   (2) Equivalence of characters that have duplicate glyphs but that
       were kept separate by Unicode for one of the reasons described in
       http://www.unicode.org/unicode/standard/where.

   (3) American-biased equivalences according to Mark Davis's UTR 21,
       which is _not_ part of the Unicode standard.

   (4) German equivalences: for example, o-umlaut equivalent to oe, and
       the German ss equivalent to the two-byte Latin sequence SS, which
       in turn is equivalent to the two-byte Latin sequence ss.

   (5) Hebrew equivalences: for example, aleph-bar equivalent to aleph.

   (6) Various Chinese equivalences for the benefit of Chinese users.

   (7) Some combination of the above.

All of these are INTEROPERABLE with the existing use of ASCII. None of
them are CONSISTENT with the existing standards. One of them, #1, has
the advantage of being by far the easiest to implement---but provides
the most opportunities for confusion and fraud.

What exactly is the rational line between, for example, #3 and #4? For
ASCII characters they both boil down to A-Z matching a-z. Why is #3 a
better extension of the current situation than #4, or #3+#4?

James Seng states that #6 is pointless because ``domain names are
identifier ... should enter into the computer exactly as they seen it or
reference it.'' Under exactly the same principle, #3 and #4 and #5 are
all pointless, so IDNA has no excuse for the costs of #3.

Another approach, allowing the software simplicity of #1 but eliminating
user confusion, is to allow _selected_ non-ASCII characters. We don't
have to map all characters to the selected set; we simply have to make
sure that the selected characters won't be confused by the users. This
neatly dodges the difficulty of defining a broad equivalence rule.

The decisions here have to be based on rational assessments of costs and
benefits. Costello's notion of ``consistency'' is obviously not helpful:
it leads to such huge costs for Chinese users that it has already drawn
objections from _three hundred_ people.

---D. J. Bernstein, Associate Professor, Department of Mathematics,
Statistics, and Computer Science, University of Illinois at Chicago