[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] First report from IDN nameprep design team



At 10:23 AM +1100 12/8/00, Frank Ernens wrote:
>Agree "prohibit" step should be last, but disagree with the map -> normalize
>order. If it differs at all from normalize -> map, it means that you
>must be imbuing some compatibility characters ("edge cases"?) with
>special meaning.

It doesn't mean that at all; this order has nothing to do with 
compatibility characters.

>  I think this is a bad idea. They should be left as
>those characters that are there by historical accident, which will rust
>away over the decades.

Nothing in the new ordering affects the rustiness of compatibility characters.

>In particular, it should be redundantly harmless for a client application
>or environment to map compatibility characters before ever handing
>a string to the resolver.

Fully agree; fortunately, it is.

>If the "edge-case" problem you are thinking of is the FULLWIDTH HYPHUS->
>MINUS one Yoshira Yoneya raised recently, then I don't think it's
>valid: (i) my copies of the mapping tables don't have that mapping,
>but rather map it to hyphus (0x2d) as expected (ii) any such special
>case mapping should be consistent across all compatibility variants
>anyway. Can we have some examples of the edge cases you have in mind?

Sure. Here are the first two of the new mappings that are there to 
make the casing correct:

037A; 0020 03B9 # GREEK YPOGEGRAMMENI
03D2; 03C5 # GREEK UPSILON WITH HOOK SYMBOL


>  > 3) So far, the mapping step in nameprep only maps uppercase
>>     characters to lowercase. The compatibility normalization step does
>>     the work of converting compatibility characters into their normal
>>     forms, but there are other sets of characters that the input
>>     mechanisms on users' systems might enter that can be mapped to other
>>     characters. For example, there are many different hyphen characters
>>     (such as U+00AD, soft hyphen) that do not get normalized but can all
>>     be mapped into the single hyphen character that is already allowed by
>>     STD 13.
>
>I agree with this proposal in general but not with the example you
>give.

Can you say why? In what kind of name would soft hyphen be 
appropriate where hyphen would not?

>
>>     Also, with the new order suggested above, there are some
>>     special cases for case-mapping that need to be added so that all
>>     characters case-map as expected.
>
>A good reason not to use that new order. If there should be an
>error in preparing this more complex mapping table, it could require
>special-case hacks forever.

Not at all true. Exactly because it is a table, there are no "hacks" 
needed: implementors simply take the table from the document and use 
it directly. If there are errors (and obviously everyone will be 
looking hard to prevent them), they simply are left there forever, no 
hacks needed.

>  > 5) Non-character codepoints will be listed as prohibited characters.
>
>Except that maybe non-authoritative caches should pass everything
>through so that only the client and authoritative server need to
>be brought up to the latest software/Unicode level to use new
>characters.

Non-character codepoints will never turn into codepoints; the are 
already assigned, but as non-characters.

>  > 6) The question of where to do name preparation will be removed from
>>     this document, but must be addressed in the eventual IDN protocol
>>     document.
>
>It has a bearing on the order, so you may need to address or
>assume it. For example, if clients do one kind of mapping and
>the resolver another, that determines the order.

There is only one kind of mapping specified. If a protocol requires 
clients and resolvers to both map (and that would be a pretty lame 
protocol), they would be doing exactly the same mapping and therefore 
getting the same result.

>** Bidirectional controls: I would hope these are going to be
>prohibited rather than thrown away, so that names and also FQDNs
>contain none. The "mathematical" alef to gimel, which differ from
>the normal Hebrew only in being left to right, must be disallowed.

The lists of characters that are prohibited and the characters that 
are thrown away have not been decided because the design team wanted 
to be sure that there was consensus on the current proposals before 
making those lists. FWIW, I fully agree with you on those choices.

>** z-Variants: what happened to that question?

There is a WG document on that; see 
<http://www.i-d-n.net/draft/draft-ietf-idn-cjk-00.txt>.

--Paul Hoffman, Director
--Internet Mail Consortium