[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Re: Is space allowed in a hostname?



At 08:25 02/07/10 +0200, Dan Oscarsson wrote:

I do not know why stringprep only have NFKC or unnormalised as possible
choices. NFC is a very suitable choice to use for UCS.
It is the choice of W3C
Yes, the most important point here being that W3C deals with all
kinds of text rather than just identifiers.


and is the required choice in IRI/URIs.
Please read the newest draft, at
http://www.ietf.org/internet-drafts/draft-duerst-iri-01.txt
NFC is indeed the default choice. Something like this is needed
because otherwise, you don't know what UCS codepoints you get for
e.g. an 'a' with two dots above. But NFC is not applied everywhere;
if you get an IRI already encoded in Unicode, it's not normalized
again, and there is even the option that a user enters something
unnormalized on purpose.

The main reason for this is that URIs/IRIs are 'greatest common
denominators' for a lot of other identifiers. It's rather clear
that we don't need a circled 'A' in IDN. But we don't want to
eliminate circled 'A's from all other identifiers. For non-normalized
text, we don't want to disallow the following:

http://example.com/normalize.cgi?input=<something-non-normalized>

Regards,     Martin.


NFC have the nice properties that it preserves all information and is
compact. It is a very good choice to use for interoperability.
Unnormalised is only useful locally on a system, never for
interoperability.
NFKC is a possible choice when matching text but may go to far when
related to identifiers (names).

   Dan