[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] nameprep forbidden characters



Kenneth Whistler <kenw@sybase.com> wrote:

> the "disallow unless specify otherwise" approach is less clear in
> terms of justifying and explaining why characters need to be omitted
> from the allowable set as part of name preparation.

Paul Hoffman / IMC <phoffman@imc.org> wrote:

> It is much easier to describe why certain characters are disallowed
> than to say why they are allowed.

I acknowledge the truth of both statements.  My main reason for
preferring the explicit-allow approach is that people are fallible,
including people who write specs.  Characters are more likely to be
accidentally left off a list than be accidentally included.  With
explicit-allow, these accidents are easily corrected, because disallowed
characters can later be allowed (that's what IDN is doing on a grand
scale).  But once a character is allowed, you can't take it back.

Also notice that the existing rules allowing only ASCII letters, digits,
and hyphen give no justification for why each and every other ASCII
character was forbidden.  Yet I think that set of characters is a very
good one for English host names, because:

  * All characters in the set are obviously safe in almost all contexts.

  * The set is large enough to be adequately expressive.

  * The set is small enough and simple enough that the identifiers
    are easy to remember.  When I want to recall the domain name for
    AT&T, I don't need to strain to remember whether "&" is allowed or
    not.  I never need to strain to remember which hosts decided to use
    underscores and which hosts decided to use hyphens.

I think the non-English communities should think twice before departing
from those principles.

Dave Crocker <dhc@dcrocker.net> wrote:

> Does this mean that Patrik Falstrom will not be able to represent his
> middle name?  It has a colon (:) in it.

Randy Bush <randy@psg.com> wrote:

> i suspect that the definition of 'punctuation' is variable. e.g. i
> know what i think of as a colon is a valid character in swedish names.

This is just like the apostrophe in O'Brien.

We're not talking about human names, we're talking about host names and
domain names.  They are identifiers; their primary purpose is not to
be pretty, or to look exactly like names that appear elsewhere; their
primary purpose is to allow humans to easily and reliably refer to
internet hosts, mail domains, etc.

But I'm curious about this Swedish colon.  Does it have a position in
the alphabet for the purpose of alphabetizing names, or is it ignored,
or does it count as a word break?  In the case of O'Brien, a human
looking up the name in a list on paper would find it in the same place
regardless of whether they were looking for O'Brien or obrien.

In English alphabetization, letters and digits are significant, but no
other symbols are, and case is insignificant.  Word breaks are often
considered significant.  Host names allow case-insensitive letters and
digits, and just one more character for breaking words.  That's quite
a correlation.  Maybe this alphabetization test reveals what humans
consider essential versus decorative/annotative.

AMC