[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] First report from IDN nameprep design team



At 2:30 PM +1100 12/9/00, Frank Ernens wrote:
>Paul Hoffman / IMC wrote:
>
>>  At 10:23 AM +1100 12/8/00, Frank Ernens wrote:
>>  >Agree "prohibit" step should be last, but disagree with the map 
>>-> normalize
>>  >order. If it differs at all from normalize -> map, it means that you
>>  >must be imbuing some compatibility characters ("edge cases"?) with
>>  >special meaning.
>>
>>  It doesn't mean that at all; this order has nothing to do with
>>  compatibility characters.
>
>"Normalize" means NFKC, I assume.

Yup.

>Let M be your mapping step. Then if the order matters, there exists
>some string s for which M (NFKC (s)) != NFKC (M (s)), and also for
>which M (compat (s)) != compat (M (s)), since the alternative
>of M (NFC (s)) != NFC (M (s)) isn't allowed by conformance
>requirement C9 of Unicode ("A process shall not assume that the
>interpretations of two canonical-equivalent character sequences
>are distinct.").

That is all true, but it still doesn't lead to us imbuing 
compatibility characters with special meaning. None of the mappings 
that we envision right now have anything to do with compatibility 
characters, only with edge cases. You are possibly thinking that we 
still intend to prohibit compatibility characters on input because 
the -00 draft did, but we explicitly said otherwise in the report. We 
want to allow an input method that puts out compatibility characters 
instead of the "real" characters to not have those characters 
rejected.

>BTW, the tables are much smaller if you use M (NFKC ()), not that
>memory footprint matters much these days.

We are mapping edge cases before normalization instead of after so 
that we don't introduce an accidental de-normalization by mapping 
afterwards. And I'm not convinced that the tables of edge cases would 
be much smaller.

>Maybe it's better that I wait until I see the revised draft!

Probably, but it is still good for folks like you to be looking at 
what we are proposing before we do the draft. That way, we don't go 
down a path that has serious problems in it.


>  > >  > 3) So far, the mapping step in nameprep only maps uppercase
>>  >>     characters to lowercase. The compatibility normalization step does
>>  >>     the work of converting compatibility characters into their normal
>>  >>     forms, but there are other sets of characters that the input
>>  >>     mechanisms on users' systems might enter that can be mapped to other
>>  >>     characters. For example, there are many different hyphen characters
>>  >>     (such as U+00AD, soft hyphen) that do not get normalized but can all
>>  >>     be mapped into the single hyphen character that is already allowed by
>>  >>     STD 13.
>>  >
>>  >I agree with this proposal in general but not with the example you
>>  >give.
>>
>>  Can you say why? In what kind of name would soft hyphen be
>>  appropriate where hyphen would not?
>
>If a domain name contains a soft hyphen, and is typeset onto paper
>such that no line break occurs, the soft hyphen is invisible and
>a human entering the name from the piece of paper won't include
>a hyphen. If there is a line break, most people will try it with
>and without the hyphen.

Exactly right. That is why we it is a bad idea to allow soft hyphens 
in host names, and why we can either prohibit them on output of 
nameprep or, to be more friendly to users whose input methods put out 
soft hyphens instead of regular ones, convert them to regular hyphens.

>I didn't want to quibble about examples
>here.

Better here than other places!

>  > >A good reason not to use that new order. If there should be an
>>  >error in preparing this more complex mapping table, it could require
>>  >special-case hacks forever.
>>
>>  Not at all true. Exactly because it is a table, there are no "hacks"
>>  needed: implementors simply take the table from the document and use
>>  it directly. If there are errors (and obviously everyone will be
>>  looking hard to prevent them), they simply are left there forever, no
>  > hacks needed.
>
>But will client software need hacks to work around them?

No: client software will use the table directly. Just to be clear, 
the intent of the design team would be that the document has:
- mapping, done by a table in the specification
- normalization, done by a process that conforms to UTR 15
- prohibition, done by a table in the specification
There is at least two pieces of open-source software that do the 
normalization, and other companies have written their own.

>  > >  > 5) Non-character codepoints will be listed as prohibited characters.
>>  >
>>  >Except that maybe non-authoritative caches should pass everything
>>  >through so that only the client and authoritative server need to
>>  >be brought up to the latest software/Unicode level to use new
>>  >characters.
>>
>  > Non-character codepoints will never turn into codepoints; the are
>>  already assigned, but as non-characters.
>
>We're talking at cross-purposes; I took "non-characters" to mean
>non-assigned.

An easy mistake. In TUS version 3, see page 327 for a description of 
the two in the BMP.

>  >
>>  >  > 6) The question of where to do name preparation will be removed from
>>  >>     this document, but must be addressed in the eventual IDN protocol
>>  >>     document.
>>  >
>>  >It has a bearing on the order, so you may need to address or
>>  >assume it. For example, if clients do one kind of mapping and
>>  >the resolver another, that determines the order.
>>
>>  There is only one kind of mapping specified. If a protocol requires
>>  clients and resolvers to both map (and that would be a pretty lame
>>  protocol), they would be doing exactly the same mapping and therefore
>>  getting the same result.
>
>The first half of the mapping (e.g. compatibility folding) might be done
>in the client and the second half (everything else) in the resolver. I
>believe the recent posters from the Japanese NIC were thinking along
>those lines.

Ah, you mean splitting nameprep into different locations in the 
protocol. Well, that is possible, but it still shouldn't be specified 
in the nameprep document. The protocol document can say which parts 
of nameprep are done where. Personally, I think splitting them is a 
very bad idea, since it makes the system harder to debug. The recent 
discussion from Yoneya-san and others was based on the nameprep-00 
document and I hope that all of their fears (particularly about 
compatibility characters) are allayed in this proposal from the 
design team.

>  > >** z-Variants: what happened to that question?
>>
>>  There is a WG document on that; see
>  > <http://www.i-d-n.net/draft/draft-ietf-idn-cjk-00.txt>.
>
>There was a thread here arising from that draft but it petered out.
>When we left it someone had pointed out that no more z-variants
>are being created - the ones that are there were a result
>of that wretched round-trip rule.
>
>If they are implemented in the current nameprep design, from each set of
>equivalent z-variants we will have to choose a preferred character
>to map to. That might involve protracted argument and therefore
>the sooner it's started the better. Alternatives are to not implement
>them and (ugh) to match instead of map. Z-variant folding would be
>part of your mapping M.

...or not done at all. The i18n community has strong disagreements about:

- should they even be mapped
- if they should be mapped, which characters are *really* z variants
- if they should be mapped, what type of end character to map them 
to: a Japanese-based character, a Chinese-based character, or a 
Korean-based character

The IETF is clearly not the group to do this work.

--Paul Hoffman, Director
--Internet Mail Consortium