[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] case folding



I have been disconnected for the net -- except for intermittent access -- for 4 weeks, so I had a bit of catching up to do. I'll try to hit a few points that people mentioned on this subject.

1. UTR#21 status. This was advanced to the "approved" stage at a recent UTC meeting. It is listed that way in the header and on the index page:
http://www.unicode.org/unicode/reports/tr21/
http://www.unicode.org/unicode/reports/

For those people who don't know about the UTC approval process, I'll summarize it here:

- A UTR normally advances from "proposed draft" to "draft" to "approved". It can sometimes then advance to "part of the standard".
- At each step, it must be passed by a majority vote of the full members present at a UTC meeting, or by a majority of all full members, if a letter ballot. Full members are listed at http://www.unicode.org/unicode/consortium/memblogo.html.
- After such passage, the final text must be approved by the UTC editorial subcommittee (which has about 8 members from the UTC).
- The UTRs can be updated with simple editorial changes or corrigenda with simple approval of the editorial committee.
- Any substantial change in an update requires approval again by the UTC.
- All versions are stable: e.g. once posted, the contents of the file for UTR #1 version 3 will not be changed, and always found at the "versioned link":
  http://www.unicode.org/unicode/reports/tr21/tr21-3.
The latest version will be found at the "unversioned" link.
  http://www.unicode.org/unicode/reports/tr21/
Each file will link back through the previous versions, so you can recover the entire history by starting at the latest "unversioned" link. For more on versioning, see http://www.unicode.org/unicode/standard/versions/

2. UTR#21 confusion. If people have any feedback on the parts of UTR#21 that they find confusing, I'd like to know.

3. Visualization. If you have a Unicode-enabled browser and the appropriate fonts (which I hope you all do!), you can see the case mappings at http://www.unicode.org/unicode/reports/tr21/charts/

Similarly, you can see the normalization forms at http://www.unicode.org/unicode/reports/tr15/charts/

4. Normalization & Casing. Because some case operations produce separate accents, Kent is right that normalization (NFC or NFD) must be performed after case operations such as case folding.

5. Normalization Form KC. It is an open issue whether one wants to use NFKC instead of NFC. Using NFKC will eliminate many distinctions that are essentially just a matter of formatting -- such as changing the "fi" ligature to an "f" and "i" sequence, changing half-width characters to the corresponding full-width variants, etc. However, also included in this list are super- and sub-scripted digits and letter-like symbols (see http://www.unicode.org/unicode/reports/tr15/charts/NormalizationChart5.html). There is some discussion of this in UTR #20, "Unicode in XML and other Markup Languages" (http://www.unicode.org/unicode/reports/tr20/#Compatibility), although this UTR must take wider view, since it discusses general data and not just identifiers or DNS names.

My view is that NFKC is generally appropriate for cases where identifiers are case-insensitive, but otherwise reasonable people may disagree with me ;-)

6. Excluding characters (controls, Hebrew cantellation marks, etc). Unicode does supply recommendations for identifier syntax, which is closely related to this subject. These are in The Unicode Standard Version 3.0, and are also summarized in "http://www.unicode.org/unicode/reports/tr15/#Programming Language Identifiers". I'll copy part of it here:

          <identifier> ::= <identifier_start> ( <identifier_start> | <identifier_extend> )*

          <identifier_start> ::= [{Lu}{Ll}{Lt}{Lm}{Lo}{Nl}]

          <identifier_extend> ::= [{Mn}{Mc}{Nd}{Pc}{Cf}]

  That is, the first character of an identifier can be an uppercase letter, lowercase letter, titlecase letter, modifier letter, other
  letter, or letter number. The subsequent characters of an identifier can be any of those, plus non-spacing marks, spacing
  combining marks, decimal numbers, connector punctuations, and formatting codes (such as right-left-mark). Normally the
  formatting codes should be filtered out before storing or comparing identifiers.

Since the normalization and casing charts are organized by General Category (and for letters, by script), you can look at them to see the implications of normalization and casing on identifiers.

7. There is a new search facility on the Unicode site, at http://www.unicode.org/search/ that may be helpful I'd recommend using the literal search for longer sequences of words, but fuzzy search for single words like "identifier".

Mark

Karlsson Kent - keka wrote:

>
>
> ...
>
> > But the problem is that since the beginning they should have ]
> > considered d-o^ng and D-O^NG different labels!
> > We cannot do this for ASCII roman letters [A-Za-z], since we must
> > retain backward compatibility, but nobody can stop us from saying
> > "if in a label there is a character outside [0-9A-Za-z-], then
> > that label is case sensitive".
>
> The idea of letting a-z be case insensitive, but "everything else"
> be case sensitive is a very strange idea.  When would the case mapping
> of A-Z (to lowercase) take place?  If it's done before normalisation
> (to normal form C or KC according to UTR 15), then any and arbitrary
> characters that are diacritised A-Z may or may not be mapped to lowercase.
> That is because there may be letters there in decomposed form, e.g.,
> A+¨ would be mapped to a+¨ and then normalised to ä, whereas Ä would
> not be touched.
>
> Even if you say that case mapping must be done after normalisation
> can produce strange effects.  Say that some orthography required the
> use of p with diaeresis.  There is no such precomposed letter. So
> even if an IDN with a P with diaeresis is mapped to lowercase
> after normalisation, then Ä would not be touched but P+¨ would be
> mapped to p+¨.  This kind of inconsistency I find unbearable.
>
> Unfortunately it seems we have to carry on with case insensitivity,
> for backwards compatibility reasons.  This means that case
> insensitivity has to be generalised to more than a-z, in some way
> that is an international compromise (since case mapping is in some
> parts dependent on orthograpy).  The only other option is to make
> IDNs case sensitive also for a-z.  An option I find preferable,
> since it greatly simplifies things, but I guess it's not acceptable
> for other reasons.
>
> A problem with current UTR 21 is the following:
>
> Say that we use normalisation form KC (which is reasonable, given that
> case insensitivity is desired, it would be strange to maintain all
> of the compatibility distinctions).
>
> If one first does normalisation to KC, and then "case fold", the
> result might not be in any normal form (D, C, KD, KC) at all. E.g. ß+´
> (which is in NF KC), case folds to ss+´ which is not in NF KC since
> there is a precomposed s with acute.
>
> If one instead first does "case folding", and then normalise to form
> KC, the result may contain uppercase letters that have a "case
> folding" to a lowercase letter.  E.g. BLACK-LETTER CAPITAL H has
> no mapping to lowercase. Then normalise to NF KC, and you will be
> left with a capital H (whereas other H-es will be mapped to h).
>
> I think that first doing case folding together with compatibility
> decomposition, and then do canonical composition (see UTF 15), would
> produce a reasonable and stable result that both ignores compatibility
> distinctions as well as case distinctions.  But UTR 21 does not (yet)
> say that that's how to do it.
>
>                 Kind regards
>                 /kent k