[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] Re: IDNA: is the specification proper, adequate, and complete? (was: Re: I-D ACTION:draft-ietf-idn-idna-08.txt)



Paul Hoffman / IMC <phoffman@imc.org> writes:

> At 6:32 PM +0200 6/17/02, Simon Josefsson wrote:
>>If I see the (swedish) word "å" displayed on my screen, cut'n'paste it
>>into a browser, an IDNA resolver will normalize this into U+00C5
>>before a server is queried for that string, regardless of whether the
>>original string was U+00C5 or U+212B.  Isn't this "resolving
>>ambiguity"?
>
> No, it is canonicalizing. In the bits on the wire, there is nothing
> ambiguous about the combined version or the uncombined version: they
> are very clearly different sets of characters, and the representation
> of those characters in every encoding of Unicode is also non-ambiguous.

Right.  But isn't the _reason_ for canonicalizing that the characters
are visually ambiguous?  The canonicalization step isn't there for the
fun of it, surely.  If the reason for canonicalization isn't to
resolve visual _ambiguity_, what is it there for?

>>   Isn't there an ambiguity between U+00C5 and U+212B?
>
> Only visually, not in the protocol.

The IDNA "protocol" handles how the visual ambiguity is resolved, with
the technique being canonicalization / normalization / decomposing.

>>   If
>>there isn't an ambiguity between U+00C5 and U+212B, why does IDNA
>>treat them the same?
>
> It doesn't. It canonicalizes one into the other. That is far from
> "treating them the same", yes?

Sorry, I don't see the difference.  After IDNA processing, strings
containing U+00C5 instead of U+212B look the same.  There is no way at
the server to tell whether the user entered U+00C5 or U+212B.  There
is no way to register two domains that only differs by the Unicode
code point for "å" chosen.  This is what I meant with IDNA treating
them the same.

Resolving ambiguity in this way can introduce ambiguity.  Consider a
user intentionally entering U+212B because it has a different meaning
than U+00C5 attached to it in Unicode, IDNA resolves this into one
code point.  Unless the user knows the Unicode 3.2 decomposition
table, it is uncertain to her whether those two code points are
treated differently or not.

>>   Perhaps I fail to communicate, not being a
>>native speaker perhaps I'm interpreting the word "ambiguous"
>>incorrectly, although my dictionary doesn't seem to help me find any
>>alternative interpretation.
>
> My dictionary has these definitions:
> - having two or more possible meanings
> - doubtful, uncertain
> U+00C5 and U+212B do not have the same meaning, and there is nothing
> doubtful or uncertain about either of them.

If you look at the character "å" it can have at least two (Unicode)
meanings, either U+00C5 or U+212B.  This ambiguity is resolved in IDNA
by normalization.  To the user, whether "å" denotes U+00C5 or U+212B
is "ambiguous".

>>  > There are charset transcoders today that transcode differently from
>>>  each other. That's not an ambiguity, that's a mistake. No one can
>>>  create protocols that fix every previous mistake.
>>
>>You can fix the one mistake.
>
> Which one mistake is that? There are probably dozens of transcoders
> with errors, and worse yet, there are probably dozens of transcoder
> implemntors that, in the face of some IETF or Unicode standard that
> tells them how to transcode, would say "screw you, you don't
> understand our language" (and they would possibly be correct).

The mistake is that the transcoding tables are not specified
somewhere, for IDNA purposes.

If they tell us that, and is correct at telling us that, why not let
them define the transcoding table?  Sounds like maybe they will do a
better job.  If different people says that, you need to invite all of
them to resolve their differences, or state that IDN will not work
with charset X because people cannot agree.  IMHO the end result will
be better than (as is proposed now) leave those people alone to
implement different transcoding tables.

>>  > So your solution is that nothing can ever be internationalized?
>>
>>That's not a solution, and that's not what I'm proposing, I don't
>>understand how that could ever be read into what I wrote,
>
> Because you said "I have trouble visualizing how this can be
> implemented and work well for 2, 5, 10 years and more, when Unicode
> and other charsets are moving targets." I agree with you that Unicode
> and other charsets are moving targets.

"This" was referring to the current IDNA proposal.

I'm convinced it is possible to implement internationalized domain
names in a way that works. (In fact, I see IDN's on ads in buses and
trains every other week or so, to doubt that it is impossible to
implement would be foolish.)

>>  but I'll try
>>to be specific on how to solve the problems with IDNA right now,
>>giving internationalized domain names that would be secure and could
>>be implemented and continue to work years ahead:
>>
>>First, specify clearly that application MUST NOT use any other
>>normalization table than the one defined in the IDNA spec suite
>>(following Unicode 3.1 currently, being updated to Unicode 3.2 if I
>>understand things correctly) and that in particular normalization
>>tables supplied by operating systems should never be used unless the
>>application author can assert that they will never change throughout
>>the lifetime of the application (which probably only will be true if
>>the application author is the operating system author).
>
> We already say the first part (you must use the Unicode 3.1 -- soon to
> be Unicode 3.2 -- table). We don't say the second part because it
> flows from the first part.

OK, perhaps it is sufficient.  We'll see how people implement it.

>>Secondly, define how to transcode legacy charsets into Unicode, and
>>specify that only this transcoding table is to be used.  Transcoding
>>mapping tables can be defined in RFCs, much like MIME CTE's or
>>similar.  The initial IDNA spec suite could define transcoding tables
>>for commonly used charsets; ISO-8859-X, ISO-2022-X, KOI8-X,
>>KS-C-5601-X etc.
>
> Yes, we could do that, but the IETF lacks both the linguistic and
> political expertise to do it. The fact that even the experts such as
> ISO and the Unicode Consortium have not chosen to do this should be a
> very broad hint to you about why the IETF shouldn't. But if you really
> think this is needed (I still don't), you absolutely should ask the
> appropriate bodies (Unicode or ISO) to do it. If they do it, I'd bet
> that the IETF would strongly consider pointing to those standards.
>
>>The main argument against these proposals are that they require lots
>>of work to implement, but if the alternative is poor security, I'd
>>rather have people do lots of work.  A lesser argument against it is
>>that they don't adopt new updates to Unicode, but that is by design.
>
> We disagree about what the main argument is. Creating transcoding
> tables is easy; in fact, it has already been done. See
> <http://www.unicode.org/Public/MAPPINGS/> for some non-official
> mappings.
>
> My main argument against the IETF doing this is that being sure the
> tables are "right" is nearly impossible because it involves getting
> consensus among the users of the scripts and the experts.

Perhaps we do agree, it was the consensus building I was referring to
with the "require lots of work".

Without solving transcoding issues, and if banks have the nerves of
using non-ASCII IDN's for critical purposes, people will exploit these
characteristics of IDNA.  With the current proposal, I think banks
will not chose to use IDN's because there are security concerns, which
I consider a partial failure of IDN.

I don't see much useful coming out of this discussion, and I'm not
offering to talk with ISO or Unicode, so I'll stop whining for now.