[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] First report from IDN nameprep design team



To the IDN WG:

The IDN nameprep design team has been studying the nameprep document,
and we propose the following changes. We are not finished with our
work, but want to report our progress and hear input from the WG. Of
course, this will be discussed heavily in San Diego next week, and a
new version of the nameprep draft can be made ready before the end of
December on the points for which there is general agreement.

1) It is difficult and probably not useful to try to prohibit
   characters that might cause confusion because they look like other
   characters or because they might be accidentally entered by users.
   Therefore, the next list of prohibited characters will be
   significantly smaller. For example, compatibility characters (which
   are common for Arabic and Asian scripts) would be allowed on input.

2) The order of the steps for nameprep will be changed from
     prohibit -> fold -> normalize
   to
     map -> normalize -> prohibit

   This new order has many advantages. It allows many more characters to
   be input to the nameprep process without returning errors because
   those characters will get converted by the normalization step into
   allowed characters. It also allows the mapping step to fix edge-case
   problems before they get to the normalization step, as described in
   the next point.

3) So far, the mapping step in nameprep only maps uppercase
   characters to lowercase. The compatibility normalization step does
   the work of converting compatibility characters into their normal
   forms, but there are other sets of characters that the input
   mechanisms on users' systems might enter that can be mapped to other
   characters. For example, there are many different hyphen characters
   (such as U+00AD, soft hyphen) that do not get normalized but can all
   be mapped into the single hyphen character that is already allowed by
   STD 13. Also, with the new order suggested above, there are some
   special cases for case-mapping that need to be added so that all
   characters case-map as expected. Some characters might be mapped to
   nothing, meaning that they will simply be ignored on input; for
   example, some of the non-displaying characters that are currently
   prohibited might instead be mapped out of the input stream instead of
   causing an error. The mapping step will be specified as a single
   table of mappings so that implementors don't have to create the table
   themselves from disparate sources.

4) Doing case-folding from the Unicode data table does not handle all
   cases of folding. The mechanism for mapping to lowercase will
   instead be derived from the CaseFolding.txt file. (See UTR 21 from
   the Unicode Consortium for more details.)

5) Non-character codepoints will be listed as prohibited characters.

6) The question of where to do name preparation will be removed from
   this document, but must be addressed in the eventual IDN protocol
   document.

7) Change the word "canonicalize" to "normalize".