[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [idn] case folding



Title: RE: [idn] case folding

I don't see what good it would be to distinguish between, e.g. "fi" and "<fi>"
(where <fi> is the fi ligature character), and between a and (fullwidth) a, etc.
Especially since we do not plan to distinguish between a and A (e.g.).

Mostly the problem is with characters with codes in the range U+F900 to
U+FFFF (but occasional others *too*).  If using KC is going too far, it is also
clear that normal form C is insufficient, not only because it (like all of the
defined normal forms) are case preserving, but also because undesirable
compatibility distinctions are maintained by NF C (and NF D).  Some
of the compatibility distinctions may be desirable to maintain, while most
are certainly undesirable to maintain as distinctions for identifier comparisons,
even case sensitive identifier name comparisons, and much more so for case
insensitive ones.

One could disallow certain compatibility (letter) characters to be part of an IDN,
but that would have the undesirable effect of producing errors (of some kind, like
"not found") for IDNs that look quite ok to the user, being nearly or fully
indistinguishable as glyph displays relative to the "proper" name.

Another possibility is for the "case folding" data file to provide "folding" *also*
of undesirable (for identifier name identity) compatibility distinctions to a
common "folded" character (or character sequence).  Then case folding
followed by normalisation to NF C would be quite ok, in my view.

                Kind regards
                /kent k


> -----Original Message-----
> From: Martin J. Duerst [mailto:duerst@w3.org]
> Sent: Thursday, June 15, 2000 12:41 PM
> To: unicore@unicode.org; Multiple Recipients of Unicore
> Cc: idn@ops.ietf.org; unicore@unicode.org
> Subject: Re: [idn] case folding
>
>
> I agree with Ken that using KC is going too far.
>
> Regards,   Martin.
>
> At 00/06/14 01:07 -0800, James Seng wrote:
> >I believe this is an important issues. Maybe someone from
> Unicore can share
> >some insight?
> >
> >ps: Yes, this is what we meant by "sharing the blame" :-)
> >
> >-James Seng
> >
> > > Karlsson Kent - keka wrote:
> > >
> > > ...
> > >
> > > > But the problem is that since the beginning they should have ]
> > > > considered d-o^ng and D-O^NG different labels!
> > > > We cannot do this for ASCII roman letters [A-Za-z],
> since we must
> > > > retain backward compatibility, but nobody can stop us
> from saying
> > > > "if in a label there is a character outside [0-9A-Za-z-], then
> > > > that label is case sensitive".
> > >
> > > The idea of letting a-z be case insensitive, but "everything else"
> > > be case sensitive is a very strange idea.  When would the
> case mapping
> > > of A-Z (to lowercase) take place?  If it's done before
> normalisation
> > > (to normal form C or KC according to UTR 15), then any
> and arbitrary
> > > characters that are diacritised A-Z may or may not be
> mapped to lowercase.
> > > That is because there may be letters there in decomposed
> form, e.g.,
> > > A+ィ would be mapped to a+ィ and then normalised to
>  whereas ト would
> > > not be touched.
> > >
> > > Even if you say that case mapping must be done after normalisation
> > > can produce strange effects.  Say that some orthography
> required the
> > > use of p with diaeresis.  There is no such precomposed letter. So
> > > even if an IDN with a P with diaeresis is mapped to lowercase
> > > after normalisation, then ト would not be touched but P+ィ would be
> > > mapped to p+ィ.  This kind of inconsistency I find unbearable.
> > >
> > > Unfortunately it seems we have to carry on with case
> insensitivity,
> > > for backwards compatibility reasons.  This means that case
> > > insensitivity has to be generalised to more than a-z, in some way
> > > that is an international compromise (since case mapping is in some
> > > parts dependent on orthograpy).  The only other option is to make
> > > IDNs case sensitive also for a-z.  An option I find preferable,
> > > since it greatly simplifies things, but I guess it's not
> acceptable
> > > for other reasons.
> > >
> > > A problem with current UTR 21 is the following:
> > >
> > > Say that we use normalisation form KC (which is
> reasonable, given that
> > > case insensitivity is desired, it would be strange to maintain all
> > > of the compatibility distinctions).
> > >
> > > If one first does normalisation to KC, and then "case fold", the
> > > result might not be in any normal form (D, C, KD, KC) at
> all. E.g. ゜+エ
> > > (which is in NF KC), case folds to ss+エ which is not in
> NF KC since
> > > there is a precomposed s with acute.
> > >
> > > If one instead first does "case folding", and then
> normalise to form
> > > KC, the result may contain uppercase letters that have a "case
> > > folding" to a lowercase letter.  E.g. BLACK-LETTER CAPITAL H has
> > > no mapping to lowercase. Then normalise to NF KC, and you will be
> > > left with a capital H (whereas other H-es will be mapped to h).
> > >
> > > I think that first doing case folding together with compatibility
> > > decomposition, and then do canonical composition (see UTF
> 15), would
> > > produce a reasonable and stable result that both ignores
> compatibility
> > > distinctions as well as case distinctions.  But UTR 21
> does not (yet)
> > > say that that's how to do it.
> > >
> > >                 Kind regards
> > >                 /kent k
>
>