[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] First report from IDN nameprep design team



James Seng/Personal wrote:

> 1) It is difficult and probably not useful to try to prohibit
>    characters that might cause confusion because they look like other
>    characters or because they might be accidentally entered by users.

Yes, agree with this.

> 2) The order of the steps for nameprep will be changed from
>      prohibit -> fold -> normalize
>    to
>      map -> normalize -> prohibit
> 
>    This new order has many advantages. It allows many more characters to
>    be input to the nameprep process without returning errors because
>    those characters will get converted by the normalization step into
>    allowed characters. It also allows the mapping step to fix edge-case
>    problems before they get to the normalization step, as described in
>    the next point.

Agree "prohibit" step should be last, but disagree with the map -> normalize
order. If it differs at all from normalize -> map, it means that you
must be imbuing some compatibility characters ("edge cases"?) with
special meaning. I think this is a bad idea. They should be left as
those characters that are there by historical accident, which will rust
away over the decades.

In particular, it should be redundantly harmless for a client application
or environment to map compatibility characters before ever handing
a string to the resolver. It may be doing it for presentation reasons
or as part of some larger application scheme involving a transformation
Unicode -> rich text -> plain text. I don't imagine many
new environments making entry of these characters easy either.

If the "edge-case" problem you are thinking of is the FULLWIDTH HYPHUS->
MINUS one Yoshira Yoneya raised recently, then I don't think it's
valid: (i) my copies of the mapping tables don't have that mapping,
but rather map it to hyphus (0x2d) as expected (ii) any such special
case mapping should be consistent across all compatibility variants
anyway. Can we have some examples of the edge cases you have in mind?

> 3) So far, the mapping step in nameprep only maps uppercase
>    characters to lowercase. The compatibility normalization step does
>    the work of converting compatibility characters into their normal
>    forms, but there are other sets of characters that the input
>    mechanisms on users' systems might enter that can be mapped to other
>    characters. For example, there are many different hyphen characters
>    (such as U+00AD, soft hyphen) that do not get normalized but can all
>    be mapped into the single hyphen character that is already allowed by
>    STD 13.

I agree with this proposal in general but not with the example you
give.

>    Also, with the new order suggested above, there are some
>    special cases for case-mapping that need to be added so that all
>    characters case-map as expected.

A good reason not to use that new order. If there should be an
error in preparing this more complex mapping table, it could require
special-case hacks forever.

> 5) Non-character codepoints will be listed as prohibited characters.

Except that maybe non-authoritative caches should pass everything
through so that only the client and authoritative server need to
be brought up to the latest software/Unicode level to use new
characters.

> 6) The question of where to do name preparation will be removed from
>    this document, but must be addressed in the eventual IDN protocol
>    document.

It has a bearing on the order, so you may need to address or
assume it. For example, if clients do one kind of mapping and
the resolver another, that determines the order.

** Bidirectional controls: I would hope these are going to be
prohibited rather than thrown away, so that names and also FQDNs
contain none. The "mathematical" alef to gimel, which differ from
the normal Hebrew only in being left to right, must be disallowed.

** z-Variants: what happened to that question?