[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Re: idn-uri document



Paul Hoffman / IMC <phoffman@imc.org> wrote:

> In draft -03, it says:
>
>     For domain names containing non-ASCII characters, the Nameprep
>     specification ([Nameprep]) defines some mappings, which mainly
>     include normalization to NFKC and folding to lower case.  When
>     encoding an internationalized domain name in an URI, these
>     mappings SHOULD NOT be applied.  It should be assumed that the
>     domain name is already normalized as far as appropriate.
>
> Why the "SHOULD NOT"?

Indeed, I also see no need for that recommendation.

> An alternate wording for the last two sentences would be:
>
>     When encoding an internationalized domain name in an URI, these
>     mappings do not need to be applied if the domain name is already
>     normalized as far as appropriate.

Am I supposed to perform some test to determine whether the domain name
is already normalized as far as appropriate?  No.  I find that clause
confusing and unnecessary.

I think the only point of this paragraph is that domain names in URIs
are not necessarily Nameprepped.  That's all it needs to say.

Draft -03 says:

> For domain names containing non-ASCII characters, the legal
> domain names are those for which the ToASCII operation ([IDNA],
> [Nameprep]; using the unescaped UTF-8 values as input), with the flags
> "UseSTD3ASCIIRules" and "AllowUnassigned" set, is successful.  The
> URI resolver MUST apply any steps required as part of domain name
> resolution by [IDNA], in particular the ToASCII operation, with the
> above-mentioned flags set.

URI resolvers should indeed set AllowUnassigned, but URI resolvers
aren't the only things that use URIs.  Consider a program that creates
HTML documents.  The domain names in those URIs are stored strings, and
Stringprep requires that unassigned code points be prohibited in stored
strings, so AllowUnassigned would have to be unset in that situation.

Finally, I'd like to point out an important caveat:  Even if the URI
generic syntax is updated to allow non-ASCII characters (escaped) in
the host field, that doesn't mean you can actually put non-ASCII domain
names into any URI you please.  The IDNA rules still apply.  If you know
that a URI is occupying an IDN-aware slot (for example, if it appears
in a new version of HTML that refers to the IDNA spec, or if it appears
in an old-HTML document but new HTTP features are used to negotiate
IDN-awareness), then you're free to put non-ASCII domain names (escaped)
into the URI.  But otherwise IDNA section 3.1 requirement 2 applies, and
the domain name must contain only ASCII characters.  The rationale is to
prevent non-ASCII domain names from falling into the unwitting hands of
old software that will choke on them.

It might be prudent for the idn-uri document to remind the reader of
this.

AMC