[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] Is space allowed in a hostname?



I have looked att parsing of hostname or domain name, and there are some
areas I think may give problems.

Today a hostname can be made of A-Z,-,0-9 and a domain name A-Z,-,0-9
and .
When we go over to UCS we will have many more characters.

Looking at the handling of combining code points in UCS, Unicode do not
handle them in a way that will be easy to hande for many programmers.
For example: SPACE which is not allowed in a ASCII hostname and should
probably
not be allowed in a UCS hostname, can easily be checked and parsed as a
separator in ASCII. But in UCS it is possible to represent spacing
accents
as SPACE + combining accent. This means that the UTF-8 form may contain
the SPACE code point which do not represent the SPACE character. That
will
make parsing much more difficult. Looking at the Unicode normalisation
forms
and trying IBM's ICU, NFC do not normalise SPACE + combining accent into
the "spacing accent" code point. NFKC does decompose instead of compose,
spacing accents.

To make things a little bit easier for software handling hostnames we
could
forbidd all accents, or we could allow spacing accents but not
"SPACE+combining
accent". In short: the SPACE code point is only allowed when not
followed
by a combining character.


While NFKC may be a good idea for matching names, it is not a good idea
for normalised form of a name. NFKC removes duplicate forms of single
characters (like wide A and circled A) which is good. But it also
replaces code points representing many characters by many characters.
In some cases that may be well (makes no semantic difference) but in
others
the resulting name is not the same (for example: superscript 2 (U+00B2)
is replaced by character 2 (U+0032)).


From what I have found out, the best normalised form of a
domain name is to use NFC with the alternative code points for those
LETTERS
that have more than one code point forbidden, and all code point
sequences
that can be represented by a single code point be combined into that
code point.

(Note: the IDNA nampreprepped name is a form used for domain name
matching.
It is not the same as the normalised form above.)

  Dan