[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] IDNA questions




Folks,

I've been trying to understand certain aspects around the handling of
ASCII vs. non-ASCII code points in IDNA, nameprep, and stringprep
but I don't yet have good arguments for all of them.

Before I start the last call I'd like to at a minimum have good arguments
for why things are done the way they are.

The ASCII world
---------------

This is the way things look like in the non-IDN world today.
We have two types of domain name labels which I'll call
host name labels (HL) and non-host name labels (NHL) in this writeup.
Most actual usage are HLs but e.g. SRV records are a limited extension
to NHL (SRV records are named by two labels which include a leading U+005F).

The syntax for NHL allows any byte value i.e. 00..FF.
The syntax for HL is letter+digit+hyphen but a label can not
have a leading or trailing hyphen.
As I understand both HL and NHL are compared using case-insensitive
matching. And case-insensitive matching is not well defined outside
of 00..7F. Thus well-defined NHL seems to in practise be limited
to 00..7F.

The HL are compared with case-insensitive matching on DNS servers, but
there might be some applications or DNS client software that modify
their case which still works.

If an application wants to compare two domain names it can use
a fairly efficient piece of code (e.g. strcasecmp offers such a thing
on certain Unix platforms).

IDNA specifications
-------------------

The state of the IDNA specifications is as follows.
We have two new types of domain name labels which I'll 
call IHL and INHL in this writeup. Both IHL and INHL can be ACE encoded
with the resulting label being a HL in most cases. (An INHL with ASCII 
characters outside of LDH will get encoded as an ACE that is a NHL; see 
example (S) in punycode).

The difference between the allowed code points in IHL and INHL are
not that great. The only difference is whether ASCII code points outside
of LDH are allowed; they are allowed in INHL but not in IHL.
Both IHL and INHL prohibit the same set of code points outside of the
ASCII range (they are all specified in stringprep).

Thus e.g. space and control characters are not allowed in INHLs.
And punctuation, symbols, etc are allowed in IHLs.

The handling of case is as follows:
If a label consists of all ASCII code points the case is not modified -
the label is not subject to nameprep.
If the label consists of some ASCII code points and some non-ASCII,
then all code points are case folded - including the ASCII code points.
This prevents ToASCII from changing the case of an all-ASCII label.

The IDNA specification says
    This document does not attempt to define an "internationalized host
    name".  It is expected that protocols and name-handling bodies
    will want to limit the characters allowed in IDNs further than
    what is specified in this document, such as to prohibit additional
    characters that they feel are unneeded or harmful in registered
    domain names.

The "protocol" part could be interpreted to mean (at least) 
two slightly different things:
 - That a future IETF standard will define internationalized host names
   and in the process exclude some code points from the set allowed in 
   IHL.
 - That a future application protocol might decide that certain code points
   should not be allowed to specify a host name as used by that application
   protocol.

If an application wants to compare two domain names it needs to perform
ToASCII on those strings and then compare them using case-insensitive
comparison. (See section 3 in the idna spec.)
If the application is going to do repeated comparisons (e.g. against
a list of domain names) in order for it to perform well the application
is likely to end up storing the result of ToASCII for the domain names
it compares against.

My questions
------------

1. Allowed characters in INHL.

For some code points (like unassigned, private use, non-character, and 
surrogate code points) it seems to make sense to prohibit them in both 
IHL and IHNL. In fact, unassigned code points MUST be prohibited in 
both for proper handling as they get assigned. [The fact that this only 
applies to stored strings is critical in general, but not important in 
this discussion.]

But why not take the same approach as for ASCII and allow space characters,
control characters, etc in INHL?
It might not matter much since protocols will (mostly) use IHLs but it
would make things follow the same philosophy as in the ASCII case.

The argument for doing things the way they are seems to be that
 - The goal is to enable the richer set of code points to be used by 
   applications
 - The existing of of non-LDH domain names in ASCII are SRV and the data
   part of the SOA record. Those records work fine in this approach.
 - While one could worry more about non-host name syntax there aren't
   clear arguments for either approach (allowing as many code points
   as possible vs. preventing the ones that obviously don't make sense
   in names), so just pick the simplest approach.
   The current approach is simple in that nameprep/stringprep is the
   same for host names and non-host names.

Are there other arguments for the current approach?


2. Initially allowed characters in IHL.

The above quote from the IDNA specification with the possible interpretations
of "protocol" doesn't provide much guidance to registries on what code points
can be safely registered i.e. does not risk to be 
declared "not a host name" in some later IETF standard.

I think there can be two different intents here:
1. The intent is that all restrictions be solely by registries ("name-
   handling bodies") based on their local policies of what to register.
   No application protocols will restrict the set of allowed code points
   beyond what is currently specific in IDNA+nameprep+stringprep.
2. The intent is at a later date the IETF will produce a specification
   for host name syntax (either for specific application protocols
   or in general) that will place some additional restrictions.

If #1 is the case then as far as I can tell "protocol" should be removed
from the sentence.

If #2 is the case then I think the document needs to provide some
guidance on what set of code points users and registries should
use in order to minimize the risk that their domain names be
declared invalid host names at a future date.

Which is the intent here?

If #2 then why can't the document explicitly state that the intent is 
that no IETF standard restrict what can be used as an internationalized 
host name by prohibiting <list of code points>?
Presumably <list of code points> could be a reference to a set of
general categories in the Unicode standard.

This would not prevent a future IETF standard to declare that a larger
set of code points can be used in internationalized host names.
And it would not prevent a particular registry to have a policy that
only allows a smaller set of code points.


3. Downcasing of ASCII

Avoiding downcasing of ASCII code points by ToASCII in an all-ASCII label
seems a bit odd. I understand that this makes an IDN-aware application
generate the same DNS queries on the wire as the application before it
was made IDN-aware (i.e. if the application looked up "Example.COM"
it would appear identically in the DNS packet after the application
was made IDN-aware). But is this important?

The reasons I've seen seem to be that some applications would take
the result of ToASCII and use it in protocol fields such as
the From: in SMTP. The effect of having ToASCII downcase an all-ASCII
label is that users will see different case characters when they
upgrade to IDNA-aware applications even though they do not use
an IDN themselves. Thus there might be some noticeable short-term effect
depending on exactly how the IDNA-aware application is coded.

The reason I'm asking is because when application protocols start carrying
IDNs in a non-ACE encoding (people seem to think that this will
happen sooner or later - perhaps for brand new application protocols and
perhaps by making some existing application protocols negotiate the 
encoding)  there will still be a need to nameprep those domain names in 
order for  comparisons to work. However, the ACE encode/decode isn't needed 
except to deal with peers that do not handle the new on-the-wire encoding.
[NOTE: I'm not talking about the DNS as a potential application protocol
here. But perhaps SMTP could be handled this way.]

This can be simply handled when that day comes by splitting
ToASCII into two parts: approximately step 1-3 to prepare in order for
comparisons to work and approximately step 4-7 for ACE conversion.

When that day comes the comparisons of the prepared labels still needs
to use case-insensitive matching for 00..7F code points just because
the preparation didn't downcase the ASCII code points. But the non-ASCII code
points don't need this i.e. applying all of the Unicode Fold() isn't
necessary.

Thus the current approach seems a bit short-sighted unless there
is a good reason that I don't understand for not case folding ASCII-only
labels.

4. Comparison

Section 3 item 3 seems to implicitly prevent Unicode-aware applications
from comparing two nameprepped strings without first converting them to ACE.
Why can't it be made explicit that, as a result of the design of the ACE
algorithm being 1-1, two Unicode labels compare as equal if and only if
the corresponding ACE labels compare as equal?

This would mean that an application, as long as it has applied
ToUnicode on the strings when received from an input method, the
protocol, or some other source, it could do a comparison without
ACE conversion.
And if case folding was applied to ASCII-only labels just like
all other labels, then the resulting comparison would just need
to be a code point by code point comparison.

---

  Erik