[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] nameprep forbidden characters



There are plusses and minuses of either an inclusive approach or an
exclusive approach. If these things are to be more like programming language
identifiers, nameprep would be exclusive. Interestingly, on the side of
inclusiveness probably the most significant change towards user
comprehension and naturalness would be to allow single interior spaces! That
may, however, be too radical a step.

I mentioned previously but you may not have seen it: on
http://www.macchiato.com/unicode/IdentifierDiff.txt there is a breakdown of
the differences between the recommended Unicode programming identifier
characters (UXI for excluding compatibility composites, UFI for including
them), the XML general identifiers (XGI), and nameprep allowed input. To
summarize:

# Characters in XGI start, but not in UXI: 19
# Characters in XGI extend, but not in UXI: 8
# Characters in UXI start (Unicode 2.1), but not in XGI: 528
# Characters in UXI extend (Unicode 2.1), but not in XGI: 8
# Characters in UXI start (Unicode 3.0), but not in XGI: 9,316
# Characters in UXI extend (Unicode 3.0), but not in XGI: 168
# Characters in both UXI and XGI, but in XGI start and UXI extend: 0
# Characters in both UXI and XGI, but in UXI start and XGI extend: 20
# Characters identical in both UXI and XGI: 35,075
# Characters in UFI but not UXI: 1,152
# Characters in UXI but not in nameprep: 56
# Characters in nameprep but not in UXI: 1,296

Both Unicode and XML distinguish between characters that can occur anywhere
in an identifier, and those that can only start one. They are very close:
most of the differences are due to additional characters in Unicode 3.0. At
some point those will need to be allowed in XGI.

Mark

"Adam M. Costello" wrote:

> I just looked at the nameprep draft for the first time.  The most
> striking feature (to me) is that, compared to existing host names, it
> takes a very different approach to specifying the allowable characters.
>
> For existing host names, the allowable categories are listed: letters,
> digits, hyphen.  Everything else is forbidden.  The nameprep draft, on
> the other hand, lists the forbidden characters (sometimes as categories,
> sometimes as enumerations), and everything else is allowed, except
> characters not yet assigned, which are forbidden.
>
> Isn't the explicit-allow approach safer?  If it is later decided that
> more characters should be allowed, they can be, but once the cat's out
> of the bag, you can't put it back in.  (Example:  The first letter of a
> host name label was originally required to be a letter, but later it was
> allowed to be a digit.)
>
> Many real English names use spaces, exclamation points, parentheses,
> periods, commas, etc, but we've been surviving just fine without them in
> host names.  In fact, having a restricted repertoire makes host names
> easier to remember, to guess, and to type.
>
> Is there some small list of UniData general categories that would be
> safe and allow a degree of flexibility in all languages analogous to
> what we have now for English?
>
> Example:  Japanese song titles often use wavy dashes, and sometimes use
> straight dashes.  Katakana words are sometimes separated by KATAKANA
> MIDDLE DOT (U+30FB), but often the dots are omitted.  Perhaps host names
> should avoid all punctuation in all languages so people don't have to
> worry about it.
>
> AMC