[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [idn] case folding

Title: RE: [idn] case folding


> But the problem is that since the beginning they should have ]
> considered d-o^ng and D-O^NG different labels!
> We cannot do this for ASCII roman letters [A-Za-z], since we must
> retain backward compatibility, but nobody can stop us from saying
> "if in a label there is a character outside [0-9A-Za-z-], then
> that label is case sensitive".

The idea of letting a-z be case insensitive, but "everything else"
be case sensitive is a very strange idea.  When would the case mapping
of A-Z (to lowercase) take place?  If it's done before normalisation
(to normal form C or KC according to UTR 15), then any and arbitrary
characters that are diacritised A-Z may or may not be mapped to lowercase.
That is because there may be letters there in decomposed form, e.g.,
A+¨ would be mapped to a+¨ and then normalised to ä, whereas Ä would
not be touched.

Even if you say that case mapping must be done after normalisation
can produce strange effects.  Say that some orthography required the
use of p with diaeresis.  There is no such precomposed letter. So
even if an IDN with a P with diaeresis is mapped to lowercase
after normalisation, then Ä would not be touched but P+¨ would be
mapped to p+¨.  This kind of inconsistency I find unbearable.

Unfortunately it seems we have to carry on with case insensitivity,
for backwards compatibility reasons.  This means that case
insensitivity has to be generalised to more than a-z, in some way
that is an international compromise (since case mapping is in some
parts dependent on orthograpy).  The only other option is to make
IDNs case sensitive also for a-z.  An option I find preferable,
since it greatly simplifies things, but I guess it's not acceptable
for other reasons.

A problem with current UTR 21 is the following:

Say that we use normalisation form KC (which is reasonable, given that
case insensitivity is desired, it would be strange to maintain all
of the compatibility distinctions).

If one first does normalisation to KC, and then "case fold", the
result might not be in any normal form (D, C, KD, KC) at all. E.g. ß+´
(which is in NF KC), case folds to ss+´ which is not in NF KC since
there is a precomposed s with acute.

If one instead first does "case folding", and then normalise to form
KC, the result may contain uppercase letters that have a "case
folding" to a lowercase letter.  E.g. BLACK-LETTER CAPITAL H has
no mapping to lowercase. Then normalise to NF KC, and you will be
left with a capital H (whereas other H-es will be mapped to h).

I think that first doing case folding together with compatibility
decomposition, and then do canonical composition (see UTF 15), would
produce a reasonable and stable result that both ignores compatibility
distinctions as well as case distinctions.  But UTR 21 does not (yet)
say that that's how to do it.

                Kind regards
                /kent k