[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] case folding



I agree with Ken that using KC is going too far.

Regards,   Martin.

At 00/06/14 01:07 -0800, James Seng wrote:
>I believe this is an important issues. Maybe someone from Unicore can share
>some insight?
>
>ps: Yes, this is what we meant by "sharing the blame" :-)
>
>-James Seng
>
> > Karlsson Kent - keka wrote:
> >
> > ...
> >
> > > But the problem is that since the beginning they should have ]
> > > considered d-o^ng and D-O^NG different labels!
> > > We cannot do this for ASCII roman letters [A-Za-z], since we must
> > > retain backward compatibility, but nobody can stop us from saying
> > > "if in a label there is a character outside [0-9A-Za-z-], then
> > > that label is case sensitive".
> >
> > The idea of letting a-z be case insensitive, but "everything else"
> > be case sensitive is a very strange idea.  When would the case mapping
> > of A-Z (to lowercase) take place?  If it's done before normalisation
> > (to normal form C or KC according to UTR 15), then any and arbitrary
> > characters that are diacritised A-Z may or may not be mapped to lowercase.
> > That is because there may be letters there in decomposed form, e.g.,
> > A+$B%#(B would be mapped to a+$B%#(B and then normalised to $Bg
(B whereas $B%H(B would
> > not be touched.
> >
> > Even if you say that case mapping must be done after normalisation
> > can produce strange effects.  Say that some orthography required the
> > use of p with diaeresis.  There is no such precomposed letter. So
> > even if an IDN with a P with diaeresis is mapped to lowercase
> > after normalisation, then $B%H(B would not be touched but P+$B%#(B would be
> > mapped to p+$B%#(B.  This kind of inconsistency I find unbearable.
> >
> > Unfortunately it seems we have to carry on with case insensitivity,
> > for backwards compatibility reasons.  This means that case
> > insensitivity has to be generalised to more than a-z, in some way
> > that is an international compromise (since case mapping is in some
> > parts dependent on orthograpy).  The only other option is to make
> > IDNs case sensitive also for a-z.  An option I find preferable,
> > since it greatly simplifies things, but I guess it's not acceptable
> > for other reasons.
> >
> > A problem with current UTR 21 is the following:
> >
> > Say that we use normalisation form KC (which is reasonable, given that
> > case insensitivity is desired, it would be strange to maintain all
> > of the compatibility distinctions).
> >
> > If one first does normalisation to KC, and then "case fold", the
> > result might not be in any normal form (D, C, KD, KC) at all. E.g. $B!,(B+$B%((B
> > (which is in NF KC), case folds to ss+$B%((B which is not in NF KC since
> > there is a precomposed s with acute.
> >
> > If one instead first does "case folding", and then normalise to form
> > KC, the result may contain uppercase letters that have a "case
> > folding" to a lowercase letter.  E.g. BLACK-LETTER CAPITAL H has
> > no mapping to lowercase. Then normalise to NF KC, and you will be
> > left with a capital H (whereas other H-es will be mapped to h).
> >
> > I think that first doing case folding together with compatibility
> > decomposition, and then do canonical composition (see UTF 15), would
> > produce a reasonable and stable result that both ignores compatibility
> > distinctions as well as case distinctions.  But UTR 21 does not (yet)
> > say that that's how to do it.
> >
> >                 Kind regards
> >                 /kent k