[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Unicode tagging



I don't know enough about the internals of DNS to to comment on the
usage model, but do have a few remarks that may be relevant.

1. The *whole* canonicalization process, as outlined in "[idn] UTC
Feedback", destroys information. That is, case folding (or folding
dashes) also destroys information: there is no way to recover the
original case information once folded. Filtering also "destroys"
information in a sense, by disallowing certain characters. Disallowing
spaces, for example, very much alters the allowable text.

2. The canonicalization process is *not* designed to be applied to
arbitrary text. It is designed to be applied to identifiers, or
similarly constrained environments where not all characters are allowed.
Superscript 2 is not a problem, because it would be filtered out before
it is ever normalized (see earlier messages about NFKC with
identifiers).

3. For arbitrary text -- not identifiers -- you are absolutely right
that NFC is the correct normalization to use.

Mark

Dan Oscarsson wrote:
> 
> >FWIW, I favor Normalization Form KC.  It makes the most sense to me to
> >normalize and canonicalize (with whatever spec is decided upon) at the
> >point of name entry, with possibilities for redundancy where folks think
> >it is prudent.
> 
> We must also remember that DNS does contain text that is not domain names.
> That text should not be normalised using form KC, here form C must be used.
> So DNS clients need to handle two forms of normalisation if KC is used
> for domain names (and it need to know what label is a host/domain name).
> Or shall we define that matching of labels shall always be matched using
> form KC? Even when a label is used for other things than host name?
> While we can require the query name to be of form KC and that matching
> of labels be done using form KC, the labels in DNS may not be of form KC
> as it destroys vital information (like replacing superscript 2 with normal 2,
> though replacing greek A with ASCII A is ok).
> 
>     Dan