[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Re: idn-uri document



I wrote:

> If you know that a URI is occupying an IDN-aware slot, then you're
> free to put non-ASCII domain names (escaped) into the URI.  But
> otherwise IDNA section 3.1 requirement 2 applies, and the domain name
> must contain only ASCII characters.

I was assuming that the URI occupies a slot of some sort.  If it is not
in a slot at all (for example, it appears in a plain text message body),
then requirement 2 does not apply, and you are free to put non-ASCII
domain names in it.

I have a question about the rationale behind allowing non-ASCII domain
names in URIs.  A URI containing a non-ASCII domain name seems to be
useful only for software that is IDN-aware but IRI-unaware.  If the
software is IDN-unaware, it will choke on the URI.  If the software is
both IDN-aware and IRI-aware, then you might as well use an IRI rather
than a URI, to avoid the ugly %-escaped UTF-8.

My question is, do we really expect to see much, if any, software that
is IDN-aware and IRI-unaware?  IDNs and IRIs are being introduced at
roughly the same time, and both deal with internationalization.  Isn't
it reasonable to assume that almost all web-related software will either
support both, or support neither?  Why complicate things by introducing
a new IDN-supporting URI type if it's not useful?  URIs with ACE are
ugly, but work with old software.  IRIs with non-ASCII domain names are
pretty, but require new software.  URIs with %-escaped non-ASCII domain
names are both ugly and require new software, so what's the point?

Here's another model to consider:

All generic URIs (URIs beginning with scheme://) continue to be
IDN-unaware, and therefore the host field must contain only ASCII
characters.  Generic IRI syntax (any IRI beginning with scheme://) is
IDN-aware, and therefore non-ASCII names are allowed in the host field
(escaped).

I don't think there will be any problem converting generic IRIs to
generic URIs, even if domain names happen to appear in other places
(the path or query-string).  In the generic syntax, the host field is
the only place a domain name can appear that has client-side semantics.
If domain names happen to appear anywhere else in the URI, they must
be interpreted by the server, right?  If non-ASCII domain names appear
in a generic IRI outside the host field, that must mean the server is
IDN-aware, and therefore those domain names don't need to be converted
to ACE if the IRI is converted to a URI.  The server will still be
IDN-aware regardless of whether it is accessed via a URI or an IRI.
The host field is the only domain name that needs to be ToASCII'd when
converting a generic IRI to a URI.

For converting non-generic IRIs to URIs, if you know the scheme, you
can extract the domain names and apply ToASCII to them.  If you don't
know the scheme, then you have no way of knowing if the URI you produce
might contain non-ASCII domain names that old software will choke on.
Whether we call the URI valid or invalid doesn't change the fact of
whether it breaks old software.  Calling the URI valid is like blaming
the old software for conforming to yesterday's standard.  Calling the
URI invalid is like blaming the convertor for performing a conversion
without enough knowledge to do it safely.  The latter makes more sense
to me.

How many non-generic URI schemes containing domain names are there?
Suppose we require that after the introduction of IRIs all new
non-generic schemes containing domain names must either be IDN-aware
for both URIs and IRIs, or IDN-unaware for both URIs and IRIs.  Then
IRI-to-URI convertors won't need to do anything special for those
schemes.  The number of schemes that convertors need to know about will
never increase.

AMC