[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] IRIs ought to use internationalized *host* names



> Okay.  Eventually this message will arrive at the following proposal:
>
>     Proposed repertoire for internationalized *host* labels:  All
>     characters in classes L (letter), M (mark), and N (number) are
>     allowed, and U+002D (hyphen-minus) is also allowed.  Everything else
>     is forbidden.

Hmm, the last discussion on this topic of hostname have been summaried by
Eric Hall - http://www.imc.org/idn/mail-archive/msg05182.html

If we want to go near this slippery slope, lets start from there.

> When converting an IRI to a URI, you have to convert the path components
> from the local charset to Unicode, then do Unicode normalization, UTF-8
> encoding, and %-escaping.  But you don't do anything to the host labels
> because they're already LDH.

Correct.

But I think IRI could be NFC-UTF-8, %-escaped even for host labels. This is
consistent with the CharMod of HTML. When IRI is used and that applications
have to look up the host names, then it can apply Nameprep-Punycode before
it hits the stub resolver.

To be more precies, lets look how browser can implementation IRI with IDNA.

- On the IRL bar -> IRL are likely to be in font encoding. Which font
  encoding would depend on the OS capability to render the font, ranging
  from locale to UTF-8. (This is why the confusing behavior of some
  browsers)

- Internally, the applications would process IRL in NFC-%-escape-UTF-8.

- For DNS resolution, host name would be extract from IRL in
  NFC-%-escape-UTF-8 which will be converted to Nameprep-Punycode
  before doing gethostbyname()

The two issues I could see would be:

1. For proxy look up, does the browser send IRI in NFC-%-escape-UTF-8?
   Or should it send out as NFC-%-escape-UTF-8 except hostname in
   Nameprep-Punycode? Or just UTF-8?

2. For "Host:" field in HTTP negiotation, does it send "Host:" in
   Nameprep-Punycode or does it send in NFC-%-escape-UTF-8? Or just
   UTF-8?

I believe for backward compatibility, browser sending both in
Nameprep-Punycode and not NFC-%-escape-UTF-8.

Am I going off tangent here?

> Which characters should be allowed in internationalized host labels?
> This is an interesting question in its own right, and it's possible that
> the IESG will demand an answer.

My believe is what is allowed in host labels is a topic for the zone
administrator to decide. .CN have a different set compared to .SG compared
to .COM compared to say IBM.COM. Each zone administrator have to decide what
to allow and what not to allowed.

> Getting back to IRIs and URIs:  I propose that conversion of an IRI
> to URI involve applying ToASCII to each host label.  This would allow
> conversion of any IRI to a URI without changing the syntax of URIs.  In
> contrast, the method proposed in draft-ietf-idn-uri-01 would change the
> URI syntax.

I like this idea.

If this is done (i.e. IRI goes thru ToASCII for host label), then we
basically saying for proxy and Host:, you should start from URI not IRI.

-James Seng