[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] case folding



I believe this is an important issues. Maybe someone from Unicore can share
some insight?

ps: Yes, this is what we meant by "sharing the blame" :-)

-James Seng

> Karlsson Kent - keka wrote:
> 
> ...
> 
> > But the problem is that since the beginning they should have ]
> > considered d-o^ng and D-O^NG different labels!
> > We cannot do this for ASCII roman letters [A-Za-z], since we must
> > retain backward compatibility, but nobody can stop us from saying
> > "if in a label there is a character outside [0-9A-Za-z-], then
> > that label is case sensitive".
> 
> The idea of letting a-z be case insensitive, but "everything else"
> be case sensitive is a very strange idea.  When would the case mapping
> of A-Z (to lowercase) take place?  If it's done before normalisation
> (to normal form C or KC according to UTR 15), then any and arbitrary
> characters that are diacritised A-Z may or may not be mapped to lowercase.
> That is because there may be letters there in decomposed form, e.g.,
> A+¨ would be mapped to a+¨ and then normalised to ä, whereas Ä would
> not be touched.
> 
> Even if you say that case mapping must be done after normalisation
> can produce strange effects.  Say that some orthography required the
> use of p with diaeresis.  There is no such precomposed letter. So
> even if an IDN with a P with diaeresis is mapped to lowercase
> after normalisation, then Ä would not be touched but P+¨ would be
> mapped to p+¨.  This kind of inconsistency I find unbearable.
> 
> Unfortunately it seems we have to carry on with case insensitivity,
> for backwards compatibility reasons.  This means that case
> insensitivity has to be generalised to more than a-z, in some way
> that is an international compromise (since case mapping is in some
> parts dependent on orthograpy).  The only other option is to make
> IDNs case sensitive also for a-z.  An option I find preferable,
> since it greatly simplifies things, but I guess it's not acceptable
> for other reasons.
> 
> A problem with current UTR 21 is the following:
> 
> Say that we use normalisation form KC (which is reasonable, given that
> case insensitivity is desired, it would be strange to maintain all
> of the compatibility distinctions).
> 
> If one first does normalisation to KC, and then "case fold", the
> result might not be in any normal form (D, C, KD, KC) at all. E.g. ß+´
> (which is in NF KC), case folds to ss+´ which is not in NF KC since
> there is a precomposed s with acute.
> 
> If one instead first does "case folding", and then normalise to form
> KC, the result may contain uppercase letters that have a "case
> folding" to a lowercase letter.  E.g. BLACK-LETTER CAPITAL H has
> no mapping to lowercase. Then normalise to NF KC, and you will be
> left with a capital H (whereas other H-es will be mapped to h).
> 
> I think that first doing case folding together with compatibility
> decomposition, and then do canonical composition (see UTF 15), would
> produce a reasonable and stable result that both ignores compatibility
> distinctions as well as case distinctions.  But UTR 21 does not (yet)
> say that that's how to do it.
> 
>                 Kind regards
>                 /kent k