[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Need for Normalization forms "KR" was: Re: [idn] case folding



At 09:32 AM 6/17/00 -0800, Mark Davis wrote:
>My view is that NFKC is generally appropriate for cases where identifiers 
>are case-insensitive, but otherwise reasonable people may disagree with me ;-)

The issue with the 'K' forms of the Normalization is twofold:

1) the set of compatibility mappings in Unicode 3.0 has 16 different
    sub-types, reflecting a wide variety of relations between characters and
    their 'compatibility equivalents'. Because of this wide range, it's
    harder for implementers to understand the consequences of applying
    forms K, compared to say, case folding.

2) some sub-types of compatibility mappings appear consistent in Version 3.0,
    but will look screwy when taking into account the imminent extensions.
    The existing characters for mathematical variables would be folded, but
    the characters to be added would not. Black Letter H would be, but Fraktur
    D would not.

However, there are some sub-types of compatibility mappings for which 
Mark's oft-repeated "they are just formatting differences" would be quite 
valid (half-width/full-width and no-break come to mind).

There are additional sub-types that have 'loss-less' compatibility 
mappings, and therefore are best folded (I like to think of these as 'near 
canonical equivalents). I'm of course referring to the  initial/ medial/ 
final/ isolated Arabic letter variants. One could argue that the <fraction> 
mappings belong here as well.

The correct approach then would be to suggest the use of a different 
normalization form, one that makes exceptions for some of more problematic 
sub-types of compatibility mappings. I like to call this form "KR" for 
"Kompatibility with Restraint".

I'm not sure whether we can fix the existing forms K. I understand that the 
*canonical* form C has been endorsed by the W3C and needs therefore to 
adhere to the stability guarantee that was made at the time. I am not aware 
that such external normative reference exists to forms K. However, nothing 
prevents UTC from doing the right thing, defining forms KR, if necessary as 
new normalization forms, and to stop endorsing or recommending the 
problematic forms K in their existing blanket form.

Specifically:

Forms KR would include these compatibility sub-types:

<initial>
<medial>
<final>
<isolated>
<no-break>
<narrow>
<wide>
<vertical>
<small>
<square>
<fraction>

Forms KR would exclude these compatibility sub-types:
<font>
<super>
<sub>
<circle> (*) see footnote

The <compat> sub-type, being the 'grab-bag' of characters
with compatibility relations that are not further
specified, and in some cases even questionable (2107) would need to be 
analyzed once, in case-by-case approach. Some examples:

Roman Numerals: KR
Parenthesized: KR
CJK and Radicals compats: KR

Dotted Alphanumerics: probably KR
Ligatures: probably KR
Telegraph symbols: probably KR

Euler Constant: not-KR
Alef Symbol, etc.: not-KR

Spacing accents (mapped to SP + combining accents): ??

etc, etc.

A./

(*) I thought about this one for some time. Dropping the circle, i.e. 
mapping (20) to 20 and forms K do, can lead to the suddenly 'bare' numbers 
or letters to coalesce with adjacent words or numbers. That would be truly 
counter intuitive to the user and is therefore best avoided. This issue 
does not apply to the parenthesized composites.