[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Need for Normalization forms "KR" was: Re: [idn] case folding



I agree with Ken that the current list of compatibility characters was not necessarily designed with identifiers or DNS names in mind, and that there are some definite oddities. But just as some people are against introducing new UTF forms (unless the current list is shown to be clearly inadaquate), I am strongly against adding new normalization formats (unless the current list is shown to be clearly inadaquate).

In particular, Paul's formulation seems exactly right:

a. Input from client software ->
b. Check for prohibited #1 ->
c. Case fold ->
d. Canonicalize ->
e. Check for prohibited #2 ->
f. Put on wire

The key then becomes: are there characters that are not prohibited by (b) or (e) that really cause problems.

Assume for now that (b) starts from the standard identifier syntax:

          <identifier> ::= <identifier_start> ( <identifier_start> | <identifier_extend> )*
          <identifier_start> ::= [{Lu}{Ll}{Lt}{Lm}{Lo}{Nl}]
          <identifier_extend> ::= [{Mn}{Mc}{Nd}{Pc}{Cf}]

Asmus's problem list included:

     <font>
     <super>
     <sub>
     <circle>
     <compat>

     The <compat> sub-type, being the 'grab-bag' of characters
     with compatibility relations that are not further
     specified, and in some cases even questionable (2107) would need to be
     analyzed once, in case-by-case approach. Some examples:

     Roman Numerals: KR
     Parenthesized: KR
     CJK and Radicals compats: KR

     Dotted Alphanumerics: probably KR
     Ligatures: probably KR
     Telegraph symbols: probably KR

     Euler Constant: not-KR
     Alef Symbol, etc.: not-KR

     Spacing accents (mapped to SP + combining accents): ??

Many of these are already eliminated by (b). You don't care that the fraction "1/2" decomposes in KC because you don't allow it in the first place; the same with many other characters. The spacing accents are also eliminated in (b).

Of the remaining ones (e.g. Euler Constant), I strongly suspect that the world can live without them being in DNS names. Thus they can also be filtered in step (b), by restricting the syntax a bit more.

So, I will rephrase my question ("I'd like to see examples of some cases that the contras believe are problems"), but more carefully this time (perhaps even this time not carefully enough for Ken, but we will see).

Given
    1. Paul's formulation of the process (with my lettering above)
    2. NFKC used for (d)
    3. UTR #21 case folding used for (c)
    4. Remaining Asmus cases (after further review) filtered by (b)

Are there characters that would not survive into (f), that MUST be in DNS names?

Mark