[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] WG last call documents



[A response to a remark by Dan Oscarsson appears at the end of this
message.]

"Eric A. Hall" <ehall@ehsco.com> wrote:

>  | Stringprep Profile for Internationalized Host Names
>
> Since this profile is specifically for host names, all of the rules
> that apply to hostnames should be placed here.

I think the intended title of this document is "Stringprep Profile
for Internationalized Domain Names".  The previous IDNA draft was
"Internationalizing Host Names in Applications", but the latest IDNA
draft is "Internationalizing Domain Names in Applications".  I think
when that change was made, the same change should have been to the title
of the nameprep draft, but was not due to a simple oversight.  Can the
nameprep authors confirm that?

In my view, (and in the view of the IDNA draft) nameprep is *not*
specifically for host names, but is for all internationalized textual
domain names.  That is how IDNA uses it.

> Consider that there are apps and protocols which may use [nameprep]
> to build internationalized host names but will not use IDNA (either
> because they do not need IDNA conversion for their data-path or they
> use something other than IDNA to encode the names).

Those apps still need the IDNA spec, because the IDNA spec defines
what is a valid internationalized domain label.  Nameprep alone cannot
possibly tell you whether the label is too long; you can know that only
after applying Punycode.

> draft-ietf-idn-idna-06.txt:
> 
>  | This document defines internationalized domain names (IDNs)
> 
> That definition should be in a separate document

Here is the definition in question:

    An "internationalized domain name" (IDN) is a domain name for which
    the ToASCII operation (see section 4) can be applied to each label
    without failing.

Clearly the IDNA spec is the logical place for that definition.  Maybe
you'd like to change the definition, but that's a separate issue.

> I thought the idea behind adopting "punycode" was to avoid the use of
> the generic "ACE" term?

Nope.  Currently, the relationship between the terms "ACE" and
"Punycode" is as follows:  An ACE label is an ASCII label that is
equivalent to a non-ASCII label.  ToASCII is the operation that converts
labels to equivalent ASCII labels.  Punycode is one of several steps of
ToASCII (other steps include nameprep and checking for various other
restrictions, like length).

The term "ACE" has been used to refer to several subtley different
concepts over the past few years, but no formal definition was ever
given until the recent IDNA drafts.

>  | 1) Whenever a domain name is put into a generic domain name slot,
>  | every label MUST contain only ASCII characters.
>
> "generic domain name slot" should be substituted with "STD13 slot".
> There is no reason whatsoever that application protocols cannot
> exchange IDNs or IHNs in whatever form they wish.

The second sentence is true, but I don't see how the first sentence
follows from it.

IDNA defines "generic domain name slot" as follows:

    An "internationalized domain name slot" is defined in this document
    to be a domain name slot explicitly designated for carrying an
    internationalized domain name as defined in this document.  The
    designation may be static (for example, in the specification of
    the protocol or interface) or dynamic (for example, as a result of
    negotiation in an interactive session).

    A "generic domain name slot" is defined in this document to be any
    domain name slot that is not an internationalized domain name slot.
    Obviously, this includes any domain name slot whose specification
    predates IDNA.

Regardless of what we call it, that is the concept that is needed in
rule 1.  Perhaps you think the term "generic domain name slot" is
counter-intuitive and should be renamed (but not redefined)?

I don't think the term "STD13 slot" is appropriate.  I would expect a
term called "STD13 slot" to be defined in terms of STD13 only, but the
concept we need here is defined in terms of the IDNA spec.

>  | 2) ACE labels SHOULD be hidden from users whenever possible.
>
> Should WHOIS servers perform this conversion on output (email
> addresses and nameservers) since it is likely to be displayed to a
> user?

That rule also says "When requirements 1 and 2 both apply, requirement
1 takes precedence."  If the output of whois is considered to be
machine-readable, then the email address domains and nameservers occupy
generic domain name slots and rule 1 requires the ASCII forms to be
used.  If the output is considered to be plain text for humans, and if
the charset supports non-ASCII, then ToUnicode should be applied to the
domain labels when the output is composed.

> Should an email client perform conversion on a message?

On the body, no.  But it should perform conversion on the domain names
in the header.

Hmmm, maybe we should insert "obtained from domain name slots" after
"ACE labels" in rule 2.

Notice that it's a SHOULD and not a MUST.  That is deliberate, because I
don't think any simple rule can describe exactly the circumstances where
you want to perform ToUnicode while excluding all the circumstances
where you wouldn't want to.  Neglecting to apply ToUnicode will at worst
cause users to see garbage, but otherwise won't break anything, so we
leave room for the application programmer to decide that rule 2 is
better left unheeded in special situations.

> CONVERSION MUST ONLY OCCUR WHERE A PROTOCOL OR DATA-FORMAT EXPLICITLY
> DEFINES THE BEHAVIOR.

That would defeat the whole point of IDNA, which is to allow
applications to use internationalized domain names with legacy protocols
and interfaces, without having to revise those protocols and interfaces.

> | 4.1 ToASCII
> | 4.2 ToUnicode
> 
> Add a disclaimer to the effect of ~"these routines MUST NOT be used
> except where the data is known to contain an i18n domain name label".

They are designed to work for all domain labels, and the rules in
section 3 require them to be applied to all domain labels.

The IDNA draft nowhere suggests using them for anything but domain
labels.  But forbidding their use on other things would be out of scope.
IDNA can assert restrictions related to domain names, but the one you
suggest would be a requirement that applies to everything *except*
domain names.

> Trying to pipe an STD13 binary domain name through ToUnicode may
> result in some false-positives.

If a domain label gets altered by ToUnicode, then it *is* an ACE label,
whether by designed or by accident.  We've known from the start that
the possibility of accidental ACEs is unavoidable in the IDNA approach,
which takes an existing corner of the namespace and repurposes it.  (But
accidents will be quite rare:  The label must contain only ASCII code
points, and begin with the ACE prefix, and the remainder must be a valid
Punycode encoding, and the decoding of that must remain unchanged by
nameprep).

You don't need to invoke "binary" domain names to illustrate this
problem, since it is equally present for ASCII domain names.  In either
case, an accidental ACE label might get displayed (or even transported)
in an untintended way, but no information will be lost, and the original
form will be restored before any program attempts to interpret the
name as intended.  (Why?  Because a program that knows the intended
interpretation will be receiving it through a slot that isn't an
internationalized domain name slot, and hence the name must have been
converted back (by ToASCII) before being put into that slot.)

>  | B. Design philosophy
> 
> What value does this section provide?

It documents the rationale behind the design decisions.  I think it's
often helpful and/or instructive for people who later read the spec and
wonder "Why did they do it that way?"

I think its generally good to have a rationale appendix in any technical
spec.  (Punycode lacks one only because I never got around to writing
one.)

> One specific item, there does not seem to be a prohibition against
> certain characters in the first position of a domain name label.

Has the current prohibition of leading & trailing hyphens been terribly
helpful?  I think it probably would have been better if the host
label syntax were simpler--just a set of allowed characters, with no
restrictions on position.  So I think we should keep any new positional
restrictions to a minimum.

But there is one class of characters that might indeed be dreadful at
the beginning: combining characters.  I recently refered to labels
that begin with combining characters as invalid Unicode strings, but
they're not, are they?  They just behave in surprising ways when abutted
with something else.  Maybe nameprep should prohibit initial combining
characters.

> > There are some code points that are prohibited in host names,
> > but not in all textual domain names.  The underscore is the best
> > example.  These prohibitions belong in ToASCII.
>
> I disagree with this for a couple of reasons.  First of all, these
> exceptions are associated with protocol identifier labels, which are
> not hostnames.

Right, they're not host names.  That's the whole point.  There exist
domain names that are not host names.

> Nameprep defines i18n hostnames, so these characters belong in the
> nameprep prohibition.

See above.  I think nameprep applies to domain names, not just host
names.

> SRV owner labels are protocol elements, not text.  They should never
> go through any conversion.  A dedicated profile for this would help.

Consider the following UTF-8 zone file:

$ORIGIN THAI1.net.
_ldap._tcp CNAME _ldap._tcp.THAI2.net

It would be ludicrous to expect the preprocessor that converts this to
a STD-13 zone file to do a lookup on _ldap._tcp.THAI2.net (which is
not defined in this zone file) and discover that it has a SRV record
and hence avoid applying conversions to _ldap._tcp.  It should be able
to simply apply ToASCII to all domain labels.  As IDNA is currently
specified, it can.

> The point here is that EVERY network service has its own
> considerations.  Demanding that all services implement a user-level
> feature is absurd.

We recommend, we don't demand.  And we don't try to apply the user-level
feature in absurd places, we say "before a domain name is displayed
to a user or is output into a context likely to be viewed by users".
Maybe you think that description is too broad.  Suggest some alternative
wording.  But we need to give some sort of encouragement for ToUnicode
to be applied in some situations.

> The outcome of this mandate is that every existing specification is
> made obsolete and non-compliant.

I don't know what it means for a specification to comply or not comply
with IDNA.  Applications can comply with IDNA, or not.

Every existing application is non-compliant with IDNA.  Obviously we
cannot expect any existing application to comply with a new spec written
after the application.

Existing applications will never apply ToASCII or ToUnicode, because
they don't know about them.  IDNA-compliant applications will apply
ToASCII and ToUnicode in various situations according to rules 1-3.

I don't see how existing specifications are made obsolete.

Dan Oscarsson <Dan.Oscarsson@trab.se> wrote:

> There should also be a requirement that a protocol using UTF-8 should
> never send data using ACE so the receiver does not have to handle the
> extra work of checking for embedded ACE.

That would be out of scope.  IDNA specifies a method for applications
to use IDNs with non-IDN-aware protocols and interfaces.  IDNA has no
business telling new IDN-aware protocols what representations they
should allow or forbid.

Besides, I think it's a good idea to do the check anyway, whenever you
display a domain name, regardless of where it came from.  ACEs might
have been leaked into the system by a gateway or cut/paste operation.

AMC