[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Compatibility mapping KC (Re: [idn] My draft for internationalisation of DNS)



At 15:59 09.02.00 +0900, Martin J. Duerst wrote:
>At 15:47 00/02/08 +0100, Dan Oscarsson wrote:
>
> > Ok. The document on the differences between form C and KC was
> > not that easy to read. I thought the idea was to remove the look a like
> > glyphs, among other things, but that is apparently wrong.
> >
> > So what do you think? Is it better to user form C, and remove
> > difficulties by excluding them from the repertoire?
>
>For the compatibility area, covered by KC, it's a one-by-one
>work. Some things are easy to eliminate by just forbidding some
>codepoints, others (smilies,...) might not be worth worrying
>one way or the other (although of course in the end we have
>to decide), others may need some detail work, e.g. a reference
>to KC (but I don't know any specific examples for these yet).

KC is a pain.
It means "decompose according to compatibility mappings, latest table 
version, then compose according to canonical mappings, Unicode 3.0.0 version".

The decomposition thus includes:

- Mapping superscript 2 and circled 2 to the number 2 (same goes for squared
   and circled Hangul)
- Mapping a singleton accent like 00A8;DIAERESIS to SPACE + COMBINING
   DIAERESIS (oddly, not for the accents in the ASCII range....)
- Mapping the various spaces like NO-BREAK SPACE to SPACE
- Mapping Hangul compatibilities like 316B;HANGUL LETTER RIEUL-PIEUP-SIOS to
   11D3;HANGUL JONGSEONG RIEUL-PIEUP-SIOS (anyone who understands what this
   means - feel free to explain!)
- Mapping all Arabic final and medial forms to their non-position-specific
   cousins (like FED5;ARABIC LETTER QAF ISOLATED FORM to 0642;ARABIC LETTER
   QAF)
- And of course my favourite Unicode mapping: Mapping FDFA;ARABIC LIGATURE
   SALLALLAHOU ALAYHE WASALLAM to the short sequence 0635 0644 0649 0020 0627
   0644 0644 0647 0020 0639 0644 064A 0647 0020 0648 0633 0644 0645 - that's
   18 characters, including 3 spaces......I'm told this single character is
   a shorthand for "in the name of Allah, the benevolent and merciful" or
   something like that, and is used in Arabic the same way English letters
   start with "Dear Sir"......

KC may be good for names, because it matches what is otherwise nonmatchable.
But I don't like it for identifiers. And it's LOTS of mappings.

This, of course, is my interpretation of the current Unicode tables from
http://www.unicode.org/, not the Word on the matter.

But: Be careful what you ask for - you might get it.....

                                 Harald

--
Harald Tveit Alvestrand, EDB Maxware, Norway
Harald.Alvestrand@edb.maxware.no