[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Compatibility mapping KC (Re: [idn] My draft for internationalisation of DNS)
- To: "Martin J. Duerst" <duerst@w3.org>, Dan Oscarsson <Dan.Oscarsson@trab.se>
- Subject: Compatibility mapping KC (Re: [idn] My draft for internationalisation of DNS)
- From: Harald Tveit Alvestrand <Harald@Alvestrand.no>
- Date: Wed, 09 Feb 2000 13:53:03 +0100
- Cc: idn@ops.ietf.org
- Delivery-date: Wed, 09 Feb 2000 04:51:27 -0800
- Envelope-to: idn-data@psg.com
At 15:59 09.02.00 +0900, Martin J. Duerst wrote:
>At 15:47 00/02/08 +0100, Dan Oscarsson wrote:
>
> > Ok. The document on the differences between form C and KC was
> > not that easy to read. I thought the idea was to remove the look a like
> > glyphs, among other things, but that is apparently wrong.
> >
> > So what do you think? Is it better to user form C, and remove
> > difficulties by excluding them from the repertoire?
>
>For the compatibility area, covered by KC, it's a one-by-one
>work. Some things are easy to eliminate by just forbidding some
>codepoints, others (smilies,...) might not be worth worrying
>one way or the other (although of course in the end we have
>to decide), others may need some detail work, e.g. a reference
>to KC (but I don't know any specific examples for these yet).
KC is a pain.
It means "decompose according to compatibility mappings, latest table
version, then compose according to canonical mappings, Unicode 3.0.0 version".
The decomposition thus includes:
- Mapping superscript 2 and circled 2 to the number 2 (same goes for squared
and circled Hangul)
- Mapping a singleton accent like 00A8;DIAERESIS to SPACE + COMBINING
DIAERESIS (oddly, not for the accents in the ASCII range....)
- Mapping the various spaces like NO-BREAK SPACE to SPACE
- Mapping Hangul compatibilities like 316B;HANGUL LETTER RIEUL-PIEUP-SIOS to
11D3;HANGUL JONGSEONG RIEUL-PIEUP-SIOS (anyone who understands what this
means - feel free to explain!)
- Mapping all Arabic final and medial forms to their non-position-specific
cousins (like FED5;ARABIC LETTER QAF ISOLATED FORM to 0642;ARABIC LETTER
QAF)
- And of course my favourite Unicode mapping: Mapping FDFA;ARABIC LIGATURE
SALLALLAHOU ALAYHE WASALLAM to the short sequence 0635 0644 0649 0020 0627
0644 0644 0647 0020 0639 0644 064A 0647 0020 0648 0633 0644 0645 - that's
18 characters, including 3 spaces......I'm told this single character is
a shorthand for "in the name of Allah, the benevolent and merciful" or
something like that, and is used in Arabic the same way English letters
start with "Dear Sir"......
KC may be good for names, because it matches what is otherwise nonmatchable.
But I don't like it for identifiers. And it's LOTS of mappings.
This, of course, is my interpretation of the current Unicode tables from
http://www.unicode.org/, not the Word on the matter.
But: Be careful what you ask for - you might get it.....
Harald
--
Harald Tveit Alvestrand, EDB Maxware, Norway
Harald.Alvestrand@edb.maxware.no