[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Fwd: Need for Normalization forms "KR" was: Re: [idn] case folding]



Perhaps you missed my previous email, listed below. The question I had posed was: which characters would be decomposed by KC that need to be in DNS names.

A. Of the characters you list that might be a problem for KC, I don't see that any of them must be in DNS names. As remarked previously, if you don't allow fractions in DNS names in the first place, it doesn't matter that you don't like what KC does with them.

> (a) Fraction substitution:
>
> As Andrew noted these can lead to term boundary issues
>
> (b) Bullet substitutions
>
> As I noted in my Normalization Form KR proposal these can also lead to
> term boundary issues, esp. for circled bullet characters
>
> (c) Spacing accents substitution
>
> Spacing accents are mapped by form KC to SPACE + non-spacing accent.
> This inappropriately introduces a space character into the term, as well as
> introducing non-spacing marks where none were in the data before
>
> (d) Math folding
>
> Form KC provides an agressive folding of letter like mathematical symbols to
> their nearest ASCII or Hebrew equivalent. Even more so than Kana folding, this
> is restricted in its usefulness.
>
> (e) various "cluster" decompositions
>
> Unicode contains many clusters, e.g. square symbols, some of the letterlike
> characters that are made up of several characters. 'Decomposing' these may or
> may not be the right thing for search equivalence. Parenthesized characters
> and numbers would probably be immune to the term boundaries issues raised
> earlier,
> but the story is less clear for others.
>

B. Of the additional foldings that one might make, you list:

> Case folding (not provided by KC)
> Kana folding (not provided by KC)
> Hyphen/Dash folding (not provided by KC)
> Han duplicates folding (partially provided by KC) (**)
> Native digit folding (not provided by KC)
>

Rather than rushing off to invent new forms, it would be productive to see which of these are important for the subject at hand. That is, which should be provided by a folding of DNS names.

- Case is already on the table, in Paul's formulation.
- I am not sure that Kana should be: Japanese speakers should weigh in. If kana *is* mapped, should it include length marks?
- Hyphen-Dash folding. KC already does the following. Which others were you thinking of?
http://www.unicode.org/unicode/reports/tr15/charts/NormalizationChart7.html
http://www.unicode.org/unicode/reports/tr15/charts/NormalizationChart10.html
- Han duplicates. What are you thinking of here? Simplified vs. Traditional is not algorithmic -- are you thinking of radicals, or some other mapping? Do you have data for whatever mapping you are thinking of?
- Native digit folding. One could do this. I don't know whether it is necessary or not. If so, it could use the data in the UCD, based on the value in field 6: See ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.html#Field Formats

Mark

--------------------
My previous message:

Mark Davis wrote:

> I agree with Ken that the current list of compatibility characters was not necessarily designed with identifiers or DNS names in mind, and that there are some definite oddities. But just as some people are against introducing new UTF forms (unless the current list is shown to be clearly inadaquate), I am strongly against adding new normalization formats (unless the current list is shown to be clearly inadaquate).
>
> In particular, Paul's formulation seems exactly right:
>
> a. Input from client software ->
> b. Check for prohibited #1 ->
> c. Case fold ->
> d. Canonicalize ->
> e. Check for prohibited #2 ->
> f. Put on wire
>
> The key then becomes: are there characters that are not prohibited by (b) or (e) that really cause problems.
>
> Assume for now that (b) starts from the standard identifier syntax:
>
>           <identifier> ::= <identifier_start> ( <identifier_start> | <identifier_extend> )*
>           <identifier_start> ::= [{Lu}{Ll}{Lt}{Lm}{Lo}{Nl}]
>           <identifier_extend> ::= [{Mn}{Mc}{Nd}{Pc}{Cf}]
>
> Asmus's problem list included:
>
>      <font>
>      <super>
>      <sub>
>      <circle>
>      <compat>
>
>      The <compat> sub-type, being the 'grab-bag' of characters
>      with compatibility relations that are not further
>      specified, and in some cases even questionable (2107) would need to be
>      analyzed once, in case-by-case approach. Some examples:
>
>      Roman Numerals: KR
>      Parenthesized: KR
>      CJK and Radicals compats: KR
>
>      Dotted Alphanumerics: probably KR
>      Ligatures: probably KR
>      Telegraph symbols: probably KR
>
>      Euler Constant: not-KR
>      Alef Symbol, etc.: not-KR
>
>      Spacing accents (mapped to SP + combining accents): ??
>
> Many of these are already eliminated by (b). You don't care that the fraction "1/2" decomposes in KC because you don't allow it in the first place; the same with many other characters. The spacing accents are also eliminated in (b).
>
> Of the remaining ones (e.g. Euler Constant), I strongly suspect that the world can live without them being in DNS names. Thus they can also be filtered in step (b), by restricting the syntax a bit more.
>
> So, I will rephrase my question ("I'd like to see examples of some cases that the contras believe are problems"), but more carefully this time (perhaps even this time not carefully enough for Ken, but we will see).
>
> Given
>     1. Paul's formulation of the process (with my lettering above)
>     2. NFKC used for (d)
>     3. UTR #21 case folding used for (c)
>     4. Remaining Asmus cases (after further review) filtered by (b)
>
> Are there characters that would not survive into (f), that MUST be in DNS names?
>
> Mark