[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] IRIs ought to use internationalized *host* names



James Seng/Personal <jseng@pobox.org.sg> wrote:

> The discussion of the how URL is to be encoded and how Host: field are
> to be handled is probably more relevant so lets get back to that.

Okay.  Eventually this message will arrive at the following proposal:

    Proposed repertoire for internationalized *host* labels:  All
    characters in classes L (letter), M (mark), and N (number) are
    allowed, and U+002D (hyphen-minus) is also allowed.  Everything else
    is forbidden.

The IRI proposal (draft-masinter-url-i18n-08) calls for the host labels
to be ASCII LDH only, just like in URIs.

When converting an IRI to a URI, you have to convert the path components
from the local charset to Unicode, then do Unicode normalization, UTF-8
encoding, and %-escaping.  But you don't do anything to the host labels
because they're already LDH.

I suspect that the reason the IRI proponents don't internationalize the
host field is that they don't yet have an official IDN spec to point at.
When they do, I suspect they'll want to revise their proposal so that
the host field can use the local charset.

This raises the question of what characters should be allowed in host
labels.  Since URIs do not allow arbitrary ASCII labels, only host
labels restricted to LDH characters, one would expect, analogously,
that IRIs would not allow arbitrary IDNs containing the exotic symbols
and punctuation allowed by Nameprep, but would allow only host labels
restricted to a selected set of characters,

Which characters should be allowed in internationalized host labels?
This is an interesting question in its own right, and it's possible that
the IESG will demand an answer.

The Unicode character database classifies each character as belonging to
exactly one of the following broad classes:

L: letter
M: mark
N: number
P: punctuation
S: symbol
Z: separator
C: other

We can start by examining which of these classes of ASCII characters are
allowed in ASCII host labels.

L: 52 exist, all are allowed
M:  0 exist
N: 10 exist, all are allowed
P: 23 exist, only hyphen-minus is allowed
S:  9 exist, none are allowed
Z:  1 exists, it is not allowed
C: 33 exist, none are allowed

We can trivially extend these results to form a simple rule covering the
entire Unicode repertoire, except that we have no precedent for class
M.  Since characters in class M tend to be things like diacritics, they
should be allowed.  So the proposed rule is:

All characters in classes L (letter), M (mark), and N (number) are
allowed, and U+002D (hyphen-minus) is also allowed.  Everything else is
forbidden.

Notice that there is no conflict with Nameprep, because Nameprep does
not prohibit any characters in classes L, M, or N.

If we were to adopt this definition of internationalized host name, it
would best be understood as an amendment of ToASCII step 3 (which checks
host name restrictions if applicable), tightening substep 3a from:

         (a) Verify the absence of non-LDH ASCII code points; that is, 
             the absence of 0..2C, 2E..2F, 3A..40, 5B..60, and 7B..7F.

to:

         (a) Verify that the sequence contains only host code points;
             that is, U+002D (hyphen-minus) and code points classified
             as L (letter), M (mark), or N (number).  See appendix ? for
             an enumeration of host code points.

Or maybe the enumeration would go in Nameprep, or in a separate document
that defines internationalized host names.

Getting back to IRIs and URIs:  I propose that conversion of an IRI
to URI involve applying ToASCII to each host label.  This would allow
conversion of any IRI to a URI without changing the syntax of URIs.  In
contrast, the method proposed in draft-ietf-idn-uri-01 would change the
URI syntax.

AMC