[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] nameprep forbidden characters



AMC noted:

> For existing host names, the allowable categories are listed: letters,
> digits, hyphen.  Everything else is forbidden.  The nameprep draft, on
> the other hand, lists the forbidden characters (sometimes as categories,
> sometimes as enumerations), and everything else is allowed, except
> characters not yet assigned, which are forbidden.
> 
> Isn't the explicit-allow approach safer? 

Technically, the two approaches are identical, since the repertoire is
defined against Unicode 3.0 and not yet assigned characters are
forbidden.

If, from a domain U, you decide to forbid the set X of characters, that
is the same as explicitly allowing the (U - X) set of characters.

It is just easier to explain, on a case-by-case, or class-by-class
basis, why certain characters *must* be excluded in the name preparation.
That is easier to justify than starting with a more pared-down list
and then trying to explain to people why their otherwise valid and
useful characters didn't make the cut. 

> Many real English names use spaces, exclamation points, parentheses,
> periods, commas, etc, but we've been surviving just fine without them in
> host names.  In fact, having a restricted repertoire makes host names
> easier to remember, to guess, and to type.

This is not so much the issue. Most currently forbidden punctuation
will have to stay forbidden, since it clashes with other syntax
usages. But if you have a host name in Mongolian, it isn't going to
be easy to remember, guess or to type for an English speaker -- but it
could well be for a Mongolian speaker (which *would* be the point),
and there is very little reason to leave out extra characters such
as the NARROW NO-BREAK SPACE that help in Mongolian legibility,
simply based on an analogy to the pared-down name space we've gotten
used to for English host and domain names -- unless an equivalencing
resulting from normalization would cause a problem for that character.

> 
> Is there some small list of UniData general categories that would be
> safe and allow a degree of flexibility in all languages analogous to
> what we have now for English?

Basically, no. The guidelines for identifier syntax on p. 135 of
TUS 3.0 are the closest thing to that. But once you introduce
Normalization Form KC, you need to consider exclusions of characters
very carefully, to avoid equivalencing problems.

You cannot just scale up ASCII letters + digits easily for application
to all the scripts of the world.

> Example:  Japanese song titles often use wavy dashes, and sometimes use
> straight dashes.  Katakana words are sometimes separated by KATAKANA
> MIDDLE DOT (U+30FB), but often the dots are omitted.  Perhaps host names
> should avoid all punctuation in all languages so people don't have to
> worry about it.

Generally, yes, but in detail no. And this just begs the question of
exactly how to define "all punctuation in all languages". Just one
obvious example: "&". This gets treated as punct by most computer
libraries, but it is historically derived as a manuscript abbreviation
for Latin "et"; it gets wordlike usage, is pronounced as "and", forms
parts of "words", such as "&c.", and so on. There are many animals
like this in writing systems around the world, where the line between
punctuation and letter is not easy to draw, and where there may not
be any particularly good *technical* reason to tell people that their
character is forbidden in a host or domain name.

--Ken Whistler

> 
> AMC
> 
>