[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Compatibility mapping KC (Re: [idn] My draft for internationalisation of DNS)



Title: RE: Compatibility mapping KC (Re: [idn] My draft for internationalisation of DNS)

(See injected comments below)

> -----Original Message-----
> From: Harald Tveit Alvestrand [mailto:Harald@Alvestrand.no]
...
> At 15:59 09.02.00 +0900, Martin J. Duerst wrote:
> >At 15:47 00/02/08 +0100, Dan Oscarsson wrote:
> >
> > > Ok. The document on the differences between form C and KC was
> > > not that easy to read. I thought the idea was to remove
> the look a like
> > > glyphs, among other things, but that is apparently wrong.
> > >
> > > So what do you think? Is it better to user form C, and remove
> > > difficulties by excluding them from the repertoire?
> >
> >For the compatibility area, covered by KC, it's a one-by-one
> >work. Some things are easy to eliminate by just forbidding some
> >codepoints, others (smilies,...) might not be worth worrying
> >one way or the other (although of course in the end we have
> >to decide), others may need some detail work, e.g. a reference
> >to KC (but I don't know any specific examples for these yet).
>
> KC is a pain.
> It means "decompose according to compatibility mappings, latest table
> version, then compose according to canonical mappings,
> Unicode 3.0.0 version".
>
> The decomposition thus includes:
>
> - Mapping superscript 2 and circled 2 to the number 2 (same
> goes for squared
>    and circled Hangul)

Yes, though not for circled dingbat digits. I think that's an oversight,
but a) it's too late to change now, these mapping will not change, and
b) who cares about these dingbats so much?.

Squared thingies are just a typographic oddity.  I think there are
proposals to add (XHTML) markup to do that more generally, rather
than use special characters.

> - Mapping a singleton accent like 00A8;DIAERESIS to SPACE + COMBINING
>    DIAERESIS (oddly, not for the accents in the ASCII range....)


Re: your parenthetical remark:

Because somebody (recently, but more than halv a year ago) asked to
have that mapping removed. And that was accepted.  And now it will
not change again. I think the reason was that KC should be applicable
to source text for computer programs and the like without upsetting
current syntax.

> - Mapping the various spaces like NO-BREAK SPACE to SPACE

Yes.

> - Mapping Hangul compatibilities like 316B;HANGUL LETTER
> RIEUL-PIEUP-SIOS to
>    11D3;HANGUL JONGSEONG RIEUL-PIEUP-SIOS (anyone who
> understands what this
>    means - feel free to explain!)

I don't read Hangul.  But my understanding is that the "letter"
versions are used for things like IMEs, and the IME can
then map according to the compatibility mappings, or map more
or less directly to the syllable characters (for a typed in
sequence).  They are there also for compatibility with Korean
standards.

The Hangul Jamo (the conjoining alphabet) are intended
to be complete, even for historical Hangul.  The precomposed
syllables are there for speedy implementation and compatability
with Korean standards.  The precomposed syllables cover only
modern Hangul.

As some of you may remember, there was a big fuss about
Korean back in 1994/1995.  Some argued that the ideal
situation would be to have only the Hangul Jamo.  These
are sufficient for Hangul.  But speed of implementation,
compatibility reasons, and political reasons resulted
in the current situation for Hangul.

By the way, there was a recent (about a year ago) change
to the Hangul compatibility mappings, in order to get
normalisation form KC to preserve Hangul syllables
in their precomposed form.


> - Mapping all Arabic final and medial forms to their
> non-position-specific
>    cousins (like FED5;ARABIC LETTER QAF ISOLATED FORM to
> 0642;ARABIC LETTER
>    QAF)

Yes, indeed.  Next to no implementation uses the presentation
form variants.  There are indeed several character encodings
for Arabic that don't have them at all.  They are present only
as glyph variants in fonts, automatically selected based on
adjacent characters.

SOME are there for compatability with IBM 'mainframe' encodings
(that should not leak out of the 'mainframe' environment,
neither the encoding, nor the presentation form letters).

MOST are there for political reasons only (request from the
Egypt NB).  Their presence is very upsetting to some beause
they have caused so much trouble, and for nothing (since
nearly nobody is using them).


> - And of course my favourite Unicode mapping: Mapping
> FDFA;ARABIC LIGATURE
>    SALLALLAHOU ALAYHE WASALLAM to the short sequence 0635
> 0644 0649 0020 0627
>    0644 0644 0647 0020 0639 0644 064A 0647 0020 0648 0633
> 0644 0645 - that's
>    18 characters, including 3 spaces......I'm told this
> single character is
>    a shorthand for "in the name of Allah, the benevolent and
> merciful" or
>    something like that, and is used in Arabic the same way
> English letters
>    start with "Dear Sir"......

Political reasons, IIUC...


> KC may be good for names, because it matches what is
> otherwise nonmatchable.
> But I don't like it for identifiers. And it's LOTS of mappings.

KC is good for formal identifiers, I think.  It does remove
distinctions that should not be made for identifiers.


                Kind regards
                /kent k


> This, of course, is my interpretation of the current Unicode
> tables from
> http://www.unicode.org/, not the Word on the matter.
>
> But: Be careful what you ask for - you might get it.....
>
>                                  Harald
>
> --
> Harald Tveit Alvestrand, EDB Maxware, Norway
> Harald.Alvestrand@edb.maxware.no
>
>