[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

=?utf-8?B?VW5pY29kZSBjYXRlZ29yaWVzLCBub3JtYWxpc2F0aW9uLCBmb3Ig?==?utf-8?B?SURO?=



Hi!

	Despite some initial trouble getting on this list,
I've now been able to subscribe to it.  Thanks Martin!
I've tried to catch up on the e-mails so far, but I've
only browsed quickly though them.

	I've been thinking a bit about how domain names should
be internationalised, and the text below reflects my current
thinking about this.  Most of this text was written before
browsing trough the e-mail archive for this list.  The
formulations are sometimes as if it is a standards document,
which it of course isn't.

	Note that I'm not a DNS expert, but I have some
knowledge about Unicode.

		Kind regards
		/Kent Karlsson


=========================================================
(Converted from a proprietary document format to plain text.
Not much touchup has been done after that.)


Domain name internationalisation

Draft 0.4

2000-01-26

Kent Karlsson, IMI—Industri-Matematik International
keka@im.se


1	Introduction

This note is about how Internet domain names should be internationalised.
It deals with the encoding and restrictions of domain names as sent to a DNS
(Domain Name Server).  Domain names can of course be stored differently
inside of documents (e.g. in XHTML documents, or e-mail messages).

At present Internet domain names are still be restricted to 7-bit ASCII
(ISO/IEC 646) as sent to a DNS, with some additional rules on which such
characters are allowed.  HTML, XML, IMAP, FTP, and many other text based
items on the Internet have already been internationalised in the sense that
a much wider range of characters are allowed, in particular using the UTF-8
encoding of Unicode or ISO/IEC 10646-1.  It is high time for domain names to
be similarly internationalised.

That the Domain name internationalisation effort should be based on
Unicode/UTF-8 is taken as a given, as there are no contenders to global
viability and backwards compatibility with the existing DNS system.


2	Unicode vs. ISO/IEC 10646

Unicode 3.0 and ISO/IEC 10646-1:2000 allocate the same characters at the
same (abstract) code positions.  They both define a UTF-8 encoding format,
with a slight difference (see below).  They also both define a UTF-16
format, but that format is not suitable for domain names as sent to a DNS
server, taking backwards compatibility into account.

Unicode (but not ISO/IEC 10646) assigns property codes to characters.  For
the purposes of this version of domain name internationalisation, both the
normative and informative general category property assignments of Unicode
3.0.0 are considered normative.


3	Unicode versioning

This version of domain name internationalisation is made with Unicode 3.0 as
a basis.  When new versions of Unicode are issued, one may need to
re-examine the domain name internationalisation.  Most likely, Unicode 3.0
will be sufficient for domain name use.


4	UTF-8 encoding

The Unicode UTF-8 format is limited to the first 17 planes, while the
ISO/IEC 10646 UTF-8 covers 32 768 planes.  For the purposes of this version
of domain name internationalisation, UTF-8 is limited to plane 0 (the Basic
Multilingual Plane) only.

The details of the UTF-8 encoding are not described here.  Please see
ISO/IEC 10646-1:2000, Annex D, or The Unicode Standard, version 3.0, annex
?, or RFC 2044.

UTF-8 is compatible with 7-bit ASCII, i.e. a 7-bit ASCII string where each
octet has the 8th bit set to 0 is in UTF-8 already.

4.1	Malformed UTF-8 encodings

Looked-up potential domain names that contain malformed UTF-8 sequences
shall be rejected by a DNS as unregistered or, optionally, as being in
error.
·	An octet with the value FE or FF is a malformed UTF-8 sequence.
·	An isolated continuation octet is a malformed UTF-8 sequence.
·	A prematurely terminated UTF-8 sequence is a malformed UTF-8
sequence.
·	An unnecessarily long (for the abstract code point encoded) UTF-8
sequence is a malformed UTF-8 sequence.
·	A UTF-8 sequence for the (abstract) code points FFFE and FFFF are
malformed UTF-8 sequences.
·	A UTF-8 sequence longer than three octets is considered malformed
for the purposes of this version of domain name internationalisation.

4.2	Surrogates

Surrogate character codes are reserved for use with UTF-16.  These are the
code points DC00 – DFFF. A UTF-8 sequence for a surrogate character code is
a malformed UTF-8 sequence.

4.3	Private use characters

Unicode reserves some code points for private use characters.  In plane 0
(BMP) these are U+E000 – U+F8FF. These are intended for use only by user
agreement of some kind.

Private use characters are inappropriate for use in domain names.  A UTF-8
sequence for a private use character code is considered a malformed UTF-8
sequence for the purposes of this version of domain name
internationalisation.


5	Unicode general categories

Unicode assigns general categories (as well as other character properties)
to characters.  The Unicode 3.0 general categories and their interpretation
for domain names are discussed in the following sections.

Unicode regards some of these properties as normative, some as informative.
For this version of internationalised domain names, all of them are
considered normative.

5.1	Letters, ideographs, and syllable characters

Lu	Letter, Uppercase 	Ok for domain names
Ll	Letter, Lowercase 	Ok for domain names
Lt	Letter, Titlecase 	Ok for domain names
Lm	Letter, Modifier 	Ok for domain names
Lo	Letter, Other 	Ok for domain names

All of the letters, ideographs, and syllable characters of Unicode 3.0 are
appropriate for use in domain names.  Note however that a difference in
letter characters need not imply a difference in domain name.  Canonical,
compatibility, and case distinctions are to be ignored.  Case distinctions
are ignored in domain names since the beginning.  Since case is ignored, so
should the less important compatibility distinctions.  See also clause 6
below about normalisation.

5.2	Combining marks

Mn 	Mark, Non-Spacing 	Must not be first, nor after a FULL STOP
(not the LEFT/RIGHT half ones)
Mc	Mark, Spacing Combining	Must not be first, nor after a FULL STOP
Me	Mark, Enclosing 	Probably inappropriate for domain names

Used with reason and in moderation, combining marks are ok for use with
domain names.  Note however that character sequence distinctions that are
equivalenced by Unicode canonical equivalence do not imply a difference in
domain name.  See also the clause about normalisation below.

There are a number of script specific rules on how combining characters
should be applied.  For the purposes of domain names, we note that they are
not to come first in any (FULL STOP separated) part of a domain name.   See
also clause 6 below about normalisation, and clause 7 below about scripts.

5.3	Numbers

Nd	Number, Decimal Digit 	Ok for domain names
Nl	Number, Letter 	Ok for domain names
No	Number, Other 	Inappropriate for domain names? (comp. decomp.)

Many “number” characters are ok for use with domain names.  Note however
that that many number characters have compatibility decomposition into
letters, ideographs, or other number characters, and so are equivalent in a
domain name.  [The “No” characters that do not have a decomposition??]

5.4	Punctuation

Pc	Punctuation, Connector 	Inappropriate for domain names (possibly
with some exceptions, like KATAKANA MIDDLE DOT)
Pd	Punctuation, Dash 	Inappropriate for domain names, except for a
few characters (see below).
Ps	Punctuation, Open 	Inappropriate for domain names
Pe	Punctuation, Close 	Inappropriate for domain names
Pi	Punctuation, Initial quote	Inappropriate for domain names
Pf	Punctuation, Final quote	Inappropriate for domain names
Po	Punctuation, Other	Inappropriate for domain names, except for a
few characters (see below).

Domain name rules have always excluded punctuation characters, except for
FULL STOP, which is given special significance within domain names.  MIDDLE
DOT and HYPHEN (or HYPHEN-MINUS) may need to be considered to be allowed.

Punctuation has been excluded from domain names proper, since some (not all)
punctuation characters in 7-bit ASCII has been used for other purposes near
domain names.  E.g. @, !, /, :, and % have special meanings near domain
names in many contexts.  Other punctuation is reserved for present or
possible future use near domain names.

BiDi and FULL STOPs (and @s)??

5.5	Symbols

Sm	Symbol, Math 	Inappropriate for domain names
Sc	Symbol, Currency 	Inappropriate for domain names
Sk	Symbol, Modifier 	Inappropriate for domain names?
So	Symbol, Other 	Inappropriate for domain names (comp. decomp.?)

As the case for punctuation, symbols are inappropriate for use with domain
names.

5.6	Separators

Zs	Separator, Space 	Inappropriate for domain names
Zl	Separator, Line 	Inappropriate for domain names
Zp	Separator, Paragraph 	Inappropriate for domain names

Spaces and similar separators (like LINE FEED) have always been considered
inappropriate for use in domain names.  Unicode has many more different
space characters than ASCII, and it also has new line/paragraph separation
characters.

5.7	Other characters

Cc	Other, Control 	Inappropriate for domain names
Cf	Other, Format 	Inappropriate for domain names (mostly??)
Cs	Other, Surrogate 	Inappropriate for domain names
Co	Other, Private Use 	Inappropriate for domain names
Cn	Other, Not Assigned	Inappropriate for domain names in this
version

Control, format, surrogate, and private use characters are inappropriate for
use in domain names.  For this version of internationalised domain names,
(abstract) code points that were unassigned in Unicode 3.0 are
inappropriate.

Note that the class Cf includes ZERO WIDTH NO-BREAK SPACE, which can be used
as a “signature” when at the beginning of a string.  This use is also
inappropriate for domain names.

5.8	The Plane 14 suggestion

The “language tag” characters, that are suggested to be allocated in plane
14, see Unicode technical report number 7, are inappropriate for use in
domain names.

5.9	ISO/IEC TR 10176 AMD 1

The technical report ISO/IEC TR 10176 (Guidelines for the preparation of
programming language standards) in its revised (soon to be AMD 1) annex
lists characters that at a minimum should be accepted in programming
language identifiers.  It does so for a “level 2 implementation” of ISO/IEC
10646.  A domain name is similar to an “identifier” in a programming
language, so what 10176 lists in its (revised!) Annex A should at least be
considered. 

See PDAM text at http://std.dkuug.dk/jtc1/sc22/wg20/docs/n699.pdf.
Note that this TR (as amended in what will be AMD 1) is based on Unicode
2.1, not Unicode 3.0.  An AMD 2, etc., is promised to only extend what is in
AMD 1.  Note also that compatibility forms are excluded from the lists in
AMD 1, but programming languages may of course allow both compatibility
forms and “level 2” combining marks.  Nothing is said in AMD 1 about
normalisation.

ISO/IEC TR 10176 PDAM 1 is supported by the Unicode consortium, and is their
(and SC22/WG20s) correction to the original list.  The original list should
be considered defective.


6	Normalisation for domain names

6.1	Case normalisation

Internet domain names have been case insensitive from the start.  When
extending the allowed characters in domain names, it would be unwise to
either abandon case insensitiveness or restrict it to just the ASCII part.
Instead, this principle should be extended to the new characters allowed in
domain names.  However, there are some problems with this.  First, the case
mappings documented by the Unicode consortium are only informative, not
normative.  Second, there are some known exceptions: like that for Turkish i
and dotless i.  Third, for several more cases the case mapping is not 1 to
1, e.g. sharp s (ß; U+00DF) maps to uppercase SS, mapping that back to
lowercase gives ss.  There are several other such cases. [not sure exactly
what to do with these]

Unicode Technical Report number 21 [UTR21] describes one way of doing this
[is that appropriate? Any better way of doing this?] SHARP S, YPOGEGRAMMENI,
PROSGEGRAMMENI?  Map to lowercase? Map to uppercase? tolower(toupper(x))?
UTR 21 (with the associated data file CaseFolding.txt) essentially
(exactly?) implies tolower(toupper(x)) (see also below); dotless i might not
be handled the way desired (in Turkey), nor is sigma and other letters with
final forms.

6.2	Unicode normalisation

Canonical distinctions, in the Unicode sense, shall be ignored.
Since case distinctions should be ignored, compatibility distinctions should
most certainly be ignored too.  Compatibility distinctions can be normalised
away with the same algorithm as canonical distinctions are normalised away.
Normalisation form KC (compatibility decomposition, logically followed by
canonical composition), see Unicode Technical Report number 15 [UTR15],
should be used for domain names, at least at registration time, if not at
lookup time.  Among a few other things, this maps WIDE, NARROW, and
PRESENTATION FORM characters to their nominal corresponding character.

It is the resulting character string after KC normalisation for which the
category test above is referring to.

Normalisation KC by itself does not imply any case normalisation.
Note that normalise(KC, casefold(x)) is not the same as
casefold(normalise(KC, x)), if casefold follows CaseFold.txt.


6.3	Further normalisation

FINAL SIGMA, FINAL KAF, FINAL MEM, FINAL NUN, FINAL PE, FINAL TSADI, FINAL
SEMKATH, BOPOMOFO FINAL *? Suggestion: ignore ‘finality’, i.e., consider to
them be equivalent with their corresponding ‘ordinary’ version. 

[Funny, CaseFolding.txt maps all sigmas to final(!) sigmas; but does nothing
for other ‘final’ characters.]

Map HYPHEN, NO-BREAK HYPHEN, and * DASHes to HYPHEN-MINUS? Remove * SOFT
HYPHEN and ZWSP?

“New line function” ‘normalisation’ (see UTR 13) does not apply to domain
names, since no domain name is to have any such character in it.


6.4	A possible alternative to normalisation: collation weighting 

A possible alternative to do KC and case normalisation is to use the ISO/IEC
14651 CTT (common template table), or the UTR 10 associated tables, with
some tailoring suitable for the DNS (no, NOT local ones).  In particular,
punctuation and symbols must be significant at level 1.  Then determine
equality up to and including level 2 (accents; similar), but not level 3
(case; hira/kata, various compatibility distinctions).

This is also based on Unicode 2.1, not yet Unicode 3.0.  Also, there is at
present NO promise not to do changes that may affect, to some degree, use of
the weightings that result.  In particular, for 14651 no particular weight
VALUES are assigned.  That up to each implementation.  For the UTR 10
tables, the actual weight values may change at any update (or in any
suitable way by tailoring, or other implementation decisions), so different
versions cannot be used in a mix.  Finally, there is no resulting “normal
form” character string from these weight tables.


7	One should not mix scripts between FULL STOPs

It is not a good idea to mix scripts freely in a single “part” of a domain
name.  E.g., it would be very confusing if an initial A is a Greek A, while
the rest of the name part is in the Latin script.

However, what constitutes a script is not clearly defined, and some
orthographies (like the Japanese) normally do mix “scripts” in a single
“word”.  Therefore this must be left for human judgement.  For an automated
service one may apply some heuristic on suggested names that may need human
scrutiny, or reject doubtful cases for registration.  Note also that ASCII
digits can be used with any other script, and many of the combining
non-spacing marks are script generic, i.e. can be used with several
different scripts.

No rigid scheme should be applied for this.  It should only be a
registration time heuristic, overrideable by human intervention.


8	&-encoding (XML), %-encoding (URL), and =-encoding (QP)

Any &-encoding used in XML (or HTML) documents in a string that contains a
domain name shall be decoded before sending the domain name to a DNS system.
Note that XML &-codes are character oriented and independent of the
character encoding used for the XML document itself.
Any %-encoding in a URL shall not be decoded in the domain name part, and %
as such is not legal in a domain name.  Such a domain name is thus
malformed.  The % character may mean something else though, so no attempt at
URL %-decoding shall be done at that point.  In addition, the octet oriented
(not character oriented) %-encoding is for an unknown character encoding,
and any attempt at decoding it by the client is likely to be in error.

Any =-encoding in an e-mail in Quoted-Printable shall be decoded according
to the charset declaration of the message.  Hopefully, Quoted-Printable will
go out of use, so this should be less of a problem...


9	E-mail address internationalisation

The pre-@ part of e-mail addresses should be internationalised in the same
way as domain names are internationalised.

====================================================================