[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] Re: IDNA: is the specification proper, adequate, and complete? (was: Re: I-D ACTION:draft-ietf-idn-idna-08.txt)



Paul Hoffman / IMC <phoffman@imc.org> writes:

>>IDNA resolves some ambiguities in identifiers by Unicode
>>normalization, and introduces further ambiguities by not handling
>>legacy charset transcoding issues at all.
>
> Simon, both of those statements are wrong, and Vint is right. Unicode
> normalization doesn't fix ambiguous references, it canonicalizes
> references: there is a huge difference between those two. "Letter A
> followed by combining umlaut" is not ambiguous: it means that the
> display should show an a with an umlaut over it.

If I see the (swedish) word "å" displayed on my screen, cut'n'paste it
into a browser, an IDNA resolver will normalize this into U+00C5
before a server is queried for that string, regardless of whether the
original string was U+00C5 or U+212B.  Isn't this "resolving
ambiguity"?  Isn't there an ambiguity between U+00C5 and U+212B?  If
there isn't an ambiguity between U+00C5 and U+212B, why does IDNA
treat them the same?  Perhaps I fail to communicate, not being a
native speaker perhaps I'm interpreting the word "ambiguous"
incorrectly, although my dictionary doesn't seem to help me find any
alternative interpretation.

> There are charset transcoders today that transcode differently from
> each other. That's not an ambiguity, that's a mistake. No one can
> create protocols that fix every previous mistake.

You can fix the one mistake.  Building a critical infrastructure on
past mistakes doesn't like a good idea to me.

>>Now, one can argue that Unicode normalization is only used because
>>Unicode happens to have different ways of representing the same, or
>>non-visual, characters, but nevertheless this adds an ambiguity
>>resolving mechanism to software.  One that will have to be modified
>>over time, as well, since consensus on how to resolve ambiguities will
>>change over time.  I have trouble visualizing how this can be
>>implemented and work well for 2, 5, 10 years and more, when Unicode
>>and other charsets are moving targets.
>
> So your solution is that nothing can ever be internationalized?

That's not a solution, and that's not what I'm proposing, I don't
understand how that could ever be read into what I wrote, but I'll try
to be specific on how to solve the problems with IDNA right now,
giving internationalized domain names that would be secure and could
be implemented and continue to work years ahead:

First, specify clearly that application MUST NOT use any other
normalization table than the one defined in the IDNA spec suite
(following Unicode 3.1 currently, being updated to Unicode 3.2 if I
understand things correctly) and that in particular normalization
tables supplied by operating systems should never be used unless the
application author can assert that they will never change throughout
the lifetime of the application (which probably only will be true if
the application author is the operating system author).

Secondly, define how to transcode legacy charsets into Unicode, and
specify that only this transcoding table is to be used.  Transcoding
mapping tables can be defined in RFCs, much like MIME CTE's or
similar.  The initial IDNA spec suite could define transcoding tables
for commonly used charsets; ISO-8859-X, ISO-2022-X, KOI8-X,
KS-C-5601-X etc.

The main argument against these proposals are that they require lots
of work to implement, but if the alternative is poor security, I'd
rather have people do lots of work.  A lesser argument against it is
that they don't adopt new updates to Unicode, but that is by design.