[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Re: IDNA: is the specification proper,adequate, and complete?



--On Tuesday, 18 June, 2002 11:41 +0200 Dan Oscarsson
<Dan.Oscarsson@trab.se> wrote:

> While there is not doubt about the above, I am not sure that
> the nameprep specification that 00DF (small letter sharp s)
> should be matted to "ss". I am not sure how Germans handle
> this character. Do they always replace double s with it? Or
> only on some special words? If they do not generally do this,
> the mapping should not be done. It is somewhat like the fact
> that the Greek version of latin A is not mapped to the Roman
> version of latin A. Even though their origin is the same latin
> A and look alike.

This was discussed at length early in the history of the WG.  By
convention, it is always possible to replace Eszett with "ss",
and the upper-case form of Eszett is always "SS", but there are
many words in which "ss" appears for which a substitution back
to Eszett is not appropriate.  In other words, this is a
one-way, non-reversible (without word-context), mapping.  The
good news is that, stringprep gets it right (or at least
consistent with other WG decisions) given coding as 0x00DF,
which clearly identifies the character as Eszett.  

The bad news is that, if this character [form] is observed in a
written string, or a non-Unicode coding system that does not
distinguish between Eszett and "small letter [Greek] beta", one
really needs to have language context in order to determine
which Unicode character to map it to, especially because, given
stringprep, the consequences of getting it wrong are likely to
be very significant.

But this is not new news, or even a new example.  We may see it
differently as our sensitivities to the issues evolve, but the
bottom line is that Unicode is not especially well adapted to
coding of strings that appear without language, or even word,
contexts in non-Unicode form.  Whether that form is a
pre-existing coding system, or a sign on the side of a bus,
there are likely to be examples of problematic characters.
Unfortunately, there are no standardized alternatives for a UCS
with even near-global applicability.  And it appears obvious to
me that, while a hypothetical Unicode alternative could make
different choices, it wouldn't eliminate these problems, but
rather just create a different set of scary examples.

I think there are only two ways out of this, and neither
involves either changes in stringprep or more examples of this
type.  For the latter, I think almost everyone with a strong
desire to understand the problem has done so and that the odds
of convincing others are fairly low.

The alternatives are:

(1) To define the problem this working group is trying to solve
in a way that causes these problems to be non-issues.
Personally, every time I try to do that, I end up with what feel
to me to be silly states, but it is clear that I'm in the
minority of those speaking up in the WG.  For example, one could
say, and I think we essentially have, that the WG is solving the
problem of getting things into and out of the DNS given that the
Unicode coding form is accurately known.  This implies that any
applications which can't succeed in making the translations are
going to be in very bad trouble and that we offer them no help
or hope -- it isn't our problem.   Personally, I think that, if
the WG's position and recommendations are based on that model,
we should be obligated to write it down and make it explicit in
our documents before they go onto the standards track: we owe
that much to those who think we are solving any of a number of
more general internationalization problems.

(2) To just give it up.  The DNS effectively imposes "no
language information" and "no script information" restrictions
on us.  Its octet-based comparison rules effectively prevent us
from imposing  conventions that would permit guessing anything
from context, nor can we prohibit a string that contains 0x00DF
in the middle of a string of Greek, or for that matter Arabic,
characters.  In the Arabic case, it would at least stand out as
strange; in the Greek one, humans would almost certainly mistake
it, in displayed form, for lower-case beta.  Given the
restrictions, these presentation-relationship problems have no
in-DNS solution.

My own personal position on this is presumably well-known at
this point:  we have tried very hard to solve a "DNS
internationalizaiton" problem and have ended up with a number of
extremely convincing demonstrations (this example not least
among them) that it is overconstrained and can't be solved.  If
one accepts that, then there is a strong case for saying "sorry,
whatever it is people wish for, and can make work in restricted
contexts, DNS labels for common, existing, RRs and applications
are limited to ASCII-based protocol elements and we had best go
solve the real problem and requirement in some less-constrained
environment".  


While others clearly disagree (and are probably the majority), I
also don't believe it is appropriate for IETF to adopt a
protocol without clear evidence that it can be implemented and
used by at least a few of the applications that are anticipated
for it -- applications that need to deal with presentation
forms, operating systems, and other aspects of the real world.
But, again, if we are going to do it, or if we see some
restricted applications for which this _does_ provide a useful
part of a solution, I believe that we are obligated to explain
exactly what problem we are solving, so as to not inadvertently
surprise implementors and users with scenarios and problems that
we understand.  If we didn't know about these problems, it might
be different, but we certainly do.

But I do think those are the issues that the WG (and really, at
this stage, the IESG) should be discussing.  And, if there are
new issues, I, at least, would like to understand them.  But
more examples or transcription or transcoding difficulties
won't, I think, help: we already have enough examples to kill a
dozen protocols if the WGs considered them relevant.  More
examples, or repeating of older ones, would not persuade anyone
who is convinced that the ones documented so far are not
relevant that they are suddenly relevant. 

Just my opinion.

     john


      john