[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] First report from IDN nameprep design team



Paul Hoffman / IMC wrote:

> At 10:23 AM +1100 12/8/00, Frank Ernens wrote:
> >Agree "prohibit" step should be last, but disagree with the map -> normalize
> >order. If it differs at all from normalize -> map, it means that you
> >must be imbuing some compatibility characters ("edge cases"?) with
> >special meaning.
> 
> It doesn't mean that at all; this order has nothing to do with
> compatibility characters.

"Normalize" means NFKC, I assume.

Let M be your mapping step. Then if the order matters, there exists
some string s for which M (NFKC (s)) != NFKC (M (s)), and also for
which M (compat (s)) != compat (M (s)), since the alternative
of M (NFC (s)) != NFC (M (s)) isn't allowed by conformance
requirement C9 of Unicode ("A process shall not assume that the
interpretations of two canonical-equivalent character sequences
are distinct.").

BTW, the tables are much smaller if you use M (NFKC ()), not that
memory footprint matters much these days.

> >[fge]
> >If the "edge-case" problem you are thinking of is the FULLWIDTH HYPHUS->
> >MINUS one Yoshira Yoneya raised recently, then I don't think it's
> >valid: (i) my copies of the mapping tables don't have that mapping,
> >but rather map it to hyphus (0x2d) as expected (ii) any such special
> >case mapping should be consistent across all compatibility variants
> [fge] Can we have some examples of the edge cases you have in mind?
> 
> Sure. Here are the first two of the new mappings that are there to
> make the casing correct:
> 
> 037A; 0020 03B9 # GREEK YPOGEGRAMMENI
> 03D2; 03C5 # GREEK UPSILON WITH HOOK SYMBOL

  i. The first is the iota subscript, which lower cases to a following iota
  ("iota adscript"). For some reason the mapping is missing from the
  Unicode tables. Maybe because it's really language-dependent; the
  *upper* case mapping can't be done without knowing it's ancient rather
  than modern Greek. By giving it a mapping, you are assuming (rightly,
  I guess) that the presence of iota subscript implies ancient Greek.
  We've already had to choose some languages over others when we found
  we couldn't handle Turkish properly, so this is reasonable.
    So this kind of "edge mapping" consists of those for all languages
  we prefer that aren't in the Unicode tables, right?

  ii. The second is the way some old fashioned people write an upper
  case Upsilon in ancient Greek; they write the lower case form normally.
  If Unicode are going to be consistent, they should regard this as a font
  variant (as they do with Coptic) and make it a compatibility equivalent to
  0x3a5 (normal Upsilon). That mapping, absent in Unicode 2.x, has indeed
  appeared in Unicode 3.0. I suspect the case mapping hasn't been updated
  and what you've found is an error in the Unicode standard.

  iii. Maybe some other "edge cases" are Unicode's specialCasing.txt,
  which includes all those cases where the length of the string changes?
 
Maybe it's better that I wait until I see the revised draft!
     
> >  > 3) So far, the mapping step in nameprep only maps uppercase
> >>     characters to lowercase. The compatibility normalization step does
> >>     the work of converting compatibility characters into their normal
> >>     forms, but there are other sets of characters that the input
> >>     mechanisms on users' systems might enter that can be mapped to other
> >>     characters. For example, there are many different hyphen characters
> >>     (such as U+00AD, soft hyphen) that do not get normalized but can all
> >>     be mapped into the single hyphen character that is already allowed by
> >>     STD 13.
> >
> >I agree with this proposal in general but not with the example you
> >give.
> 
> Can you say why? In what kind of name would soft hyphen be
> appropriate where hyphen would not?

If a domain name contains a soft hyphen, and is typeset onto paper
such that no line break occurs, the soft hyphen is invisible and
a human entering the name from the piece of paper won't include
a hyphen. If there is a line break, most people will try it with
and without the hyphen. I didn't want to quibble about examples
here.

> >
> >>     Also, with the new order suggested above, there are some
> >>     special cases for case-mapping that need to be added so that all
> >>     characters case-map as expected.
> >
> >A good reason not to use that new order. If there should be an
> >error in preparing this more complex mapping table, it could require
> >special-case hacks forever.
> 
> Not at all true. Exactly because it is a table, there are no "hacks"
> needed: implementors simply take the table from the document and use
> it directly. If there are errors (and obviously everyone will be
> looking hard to prevent them), they simply are left there forever, no
> hacks needed.

But will client software need hacks to work around them?

> >  > 5) Non-character codepoints will be listed as prohibited characters.
> >
> >Except that maybe non-authoritative caches should pass everything
> >through so that only the client and authoritative server need to
> >be brought up to the latest software/Unicode level to use new
> >characters.
> 
> Non-character codepoints will never turn into codepoints; the are
> already assigned, but as non-characters.

We're talking at cross-purposes; I took "non-characters" to mean
non-assigned.

> 
> >  > 6) The question of where to do name preparation will be removed from
> >>     this document, but must be addressed in the eventual IDN protocol
> >>     document.
> >
> >It has a bearing on the order, so you may need to address or
> >assume it. For example, if clients do one kind of mapping and
> >the resolver another, that determines the order.
> 
> There is only one kind of mapping specified. If a protocol requires
> clients and resolvers to both map (and that would be a pretty lame
> protocol), they would be doing exactly the same mapping and therefore
> getting the same result.

The first half of the mapping (e.g. compatibility folding) might be done
in the client and the second half (everything else) in the resolver. I
believe the recent posters from the Japanese NIC were thinking along
those lines.

> >** z-Variants: what happened to that question?
> 
> There is a WG document on that; see
> <http://www.i-d-n.net/draft/draft-ietf-idn-cjk-00.txt>.

There was a thread here arising from that draft but it petered out.
When we left it someone had pointed out that no more z-variants
are being created - the ones that are there were a result
of that wretched round-trip rule.

If they are implemented in the current nameprep design, from each set of
equivalent z-variants we will have to choose a preferred character
to map to. That might involve protracted argument and therefore
the sooner it's started the better. Alternatives are to not implement
them and (ugh) to match instead of map. Z-variant folding would be
part of your mapping M.