[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Re: IDNA: is the specification proper, adequate, and complete?



> The bad news is that, if this character [form] is observed in a
> written string, or a non-Unicode coding system that does not
> distinguish between Eszett and "small letter [Greek] beta", one
> really needs to have language context in order to determine
> which Unicode character to map it to, especially because, given
> stringprep, the consequences of getting it wrong are likely to
> be very significant.

A few items here.

1. While there are many Unicode characters with very similar shapes,
beta and eszed -- in normal fonts -- are really no more alike than y
and gamma (or, for that matter, capital I, lowercase L, and 1!!). Beta
typically has a descender, while eszed does not. See
http://www.macchiato.com/unicode/beta-eszed.htm for a GIF image from
three fonts.

2. There are no "non-Unicode coding systems" that unify beta and
eszed; the language issue is irrelevant.

3. As I pointed out some time ago on this list, it is *not* rocket
science to provide a user interface that makes it very clear to people
if there are mixed scripts in a domain name; and also simple to extend
it to other confusables. For a demo, see
http://www.macchiato.com/utc/show_script.html.

I have recommended and continue to recommend that the IDNA documents
contain some wording on this, something like:

"To help prevent confusion between characters that are visually
similar, it is recommended that implementations provide visual
indications where a domain name contains multiple scripts. Such
mechanisms can also be used to show when a name contains a mixture of
simplified and traditional characters, or to distinguish zero and one
from O and l."

Mark
__________
http://www.macchiato.com
◄  “Eppur si muove” ►

----- Original Message -----
From: "John C Klensin" <klensin@jck.com>
To: "Dan Oscarsson" <Dan.Oscarsson@trab.se>; <idn@ops.ietf.org>
Sent: Tuesday, June 18, 2002 07:12
Subject: Re: [idn] Re: IDNA: is the specification proper, adequate,
and complete?


> --On Tuesday, 18 June, 2002 11:41 +0200 Dan Oscarsson
> <Dan.Oscarsson@trab.se> wrote:
>
> > While there is not doubt about the above, I am not sure that
> > the nameprep specification that 00DF (small letter sharp s)
> > should be matted to "ss". I am not sure how Germans handle
> > this character. Do they always replace double s with it? Or
> > only on some special words? If they do not generally do this,
> > the mapping should not be done. It is somewhat like the fact
> > that the Greek version of latin A is not mapped to the Roman
> > version of latin A. Even though their origin is the same latin
> > A and look alike.
>
> This was discussed at length early in the history of the WG.  By
> convention, it is always possible to replace Eszett with "ss",
> and the upper-case form of Eszett is always "SS", but there are
> many words in which "ss" appears for which a substitution back
> to Eszett is not appropriate.  In other words, this is a
> one-way, non-reversible (without word-context), mapping.  The
> good news is that, stringprep gets it right (or at least
> consistent with other WG decisions) given coding as 0x00DF,
> which clearly identifies the character as Eszett.
>
> The bad news is that, if this character [form] is observed in a
> written string, or a non-Unicode coding system that does not
> distinguish between Eszett and "small letter [Greek] beta", one
> really needs to have language context in order to determine
> which Unicode character to map it to, especially because, given
> stringprep, the consequences of getting it wrong are likely to
> be very significant.
>
> But this is not new news, or even a new example.  We may see it
> differently as our sensitivities to the issues evolve, but the
> bottom line is that Unicode is not especially well adapted to
> coding of strings that appear without language, or even word,
> contexts in non-Unicode form.  Whether that form is a
> pre-existing coding system, or a sign on the side of a bus,
> there are likely to be examples of problematic characters.
> Unfortunately, there are no standardized alternatives for a UCS
> with even near-global applicability.  And it appears obvious to
> me that, while a hypothetical Unicode alternative could make
> different choices, it wouldn't eliminate these problems, but
> rather just create a different set of scary examples.
>
> I think there are only two ways out of this, and neither
> involves either changes in stringprep or more examples of this
> type.  For the latter, I think almost everyone with a strong
> desire to understand the problem has done so and that the odds
> of convincing others are fairly low.
>
> The alternatives are:
>
> (1) To define the problem this working group is trying to solve
> in a way that causes these problems to be non-issues.
> Personally, every time I try to do that, I end up with what feel
> to me to be silly states, but it is clear that I'm in the
> minority of those speaking up in the WG.  For example, one could
> say, and I think we essentially have, that the WG is solving the
> problem of getting things into and out of the DNS given that the
> Unicode coding form is accurately known.  This implies that any
> applications which can't succeed in making the translations are
> going to be in very bad trouble and that we offer them no help
> or hope -- it isn't our problem.   Personally, I think that, if
> the WG's position and recommendations are based on that model,
> we should be obligated to write it down and make it explicit in
> our documents before they go onto the standards track: we owe
> that much to those who think we are solving any of a number of
> more general internationalization problems.
>
> (2) To just give it up.  The DNS effectively imposes "no
> language information" and "no script information" restrictions
> on us.  Its octet-based comparison rules effectively prevent us
> from imposing  conventions that would permit guessing anything
> from context, nor can we prohibit a string that contains 0x00DF
> in the middle of a string of Greek, or for that matter Arabic,
> characters.  In the Arabic case, it would at least stand out as
> strange; in the Greek one, humans would almost certainly mistake
> it, in displayed form, for lower-case beta.  Given the
> restrictions, these presentation-relationship problems have no
> in-DNS solution.
>
> My own personal position on this is presumably well-known at
> this point:  we have tried very hard to solve a "DNS
> internationalizaiton" problem and have ended up with a number of
> extremely convincing demonstrations (this example not least
> among them) that it is overconstrained and can't be solved.  If
> one accepts that, then there is a strong case for saying "sorry,
> whatever it is people wish for, and can make work in restricted
> contexts, DNS labels for common, existing, RRs and applications
> are limited to ASCII-based protocol elements and we had best go
> solve the real problem and requirement in some less-constrained
> environment".
>
>
> While others clearly disagree (and are probably the majority), I
> also don't believe it is appropriate for IETF to adopt a
> protocol without clear evidence that it can be implemented and
> used by at least a few of the applications that are anticipated
> for it -- applications that need to deal with presentation
> forms, operating systems, and other aspects of the real world.
> But, again, if we are going to do it, or if we see some
> restricted applications for which this _does_ provide a useful
> part of a solution, I believe that we are obligated to explain
> exactly what problem we are solving, so as to not inadvertently
> surprise implementors and users with scenarios and problems that
> we understand.  If we didn't know about these problems, it might
> be different, but we certainly do.
>
> But I do think those are the issues that the WG (and really, at
> this stage, the IESG) should be discussing.  And, if there are
> new issues, I, at least, would like to understand them.  But
> more examples or transcription or transcoding difficulties
> won't, I think, help: we already have enough examples to kill a
> dozen protocols if the WGs considered them relevant.  More
> examples, or repeating of older ones, would not persuade anyone
> who is convinced that the ones documented so far are not
> relevant that they are suddenly relevant.
>
> Just my opinion.
>
>      john
>
>
>       john
>
>
>