[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [idn] Some new ideas in my updated draft



Title: RE: [idn] Some new ideas in my updated draft

Dan wrote (in his updated raft):
   Note: Normalisation form KC could have been possible to use instead
   of form C, but form KC is both much more complex to handle and
   does not preserver all semantics of the text. Form KC would make
   some character match equally, that will not do that in form C.
   Problems with different character representations can be fixed
   with a separate recommendation of what characters should be used
   in domain names.

It is not "much more complex"; actually there is only one normalisation
algorithm (formally), with four target normal forms (D, C, KD, KC).
Since downcasing (to get one form of caseless comparison) is to be
used here, I very strongly recommend using normalisation form KC.
The alternative is to forbid certain compatibility characters (which
I see no real reason to) or to consider compatibility variants to be
different (which does not make much sense if case is not significant).

Regarding complexity, the easy bit is the 'decomposition' (canonical
or compatibility).  The hard bit is the 'composition', which may 'reach
out' a bit.  E.g. <a><dot below><ring above> normalises (C or KC) to
<a with ring above><dot below>.

Dan continues:
   Note: Case folding to lower case using UTR#21 is not perfect. For
   example in Turkey I is lower cased into a dotless i, but UTR#21
   does it in the old ASCII way (I -> i). This way we get a well
   defined lower casing that can be used in matching, but it will
   not be correct for all local rules of different languages.
   The Turkish problem can be dealt with by asking users to
   only use a lower case dotless i, when needed.

I suggest ignoring UTR 21.  Just downcase according to the 'default'
(non-normative) in the main property table for Unicode 3.0.
UTR21 equivalences too much, I think.  Using UTR21 AND at the same
time consider compatibility variants to be different, would be
totally strange, I find.  In addition UTR 21 results in something
that might NOT be in normal form (i.e. neither D, C, KD, or KC).
So if one uses UTR 21 caselessness, normalisation must be done
AFTER that.  Plain downcasing does not result in such problems.
(Though plain uppercasing does...)

                Kind regards
                /kent k