[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Unicode categories, normalisation, for IDN



Good issues on using Unicode on IDN.

Most of the issues raised are important such as numbers, symbols etc which I
will incorporate into the next doc. However, lets focus on the requirement doc
and not diverge into the implementation yet. Hopefully, we have at least
version 0 of the requirement draft before the next IETF meeting in March.

Kent, please refer to ftp://ops.ietf.org/pub/lists/idn* on the archives of the
discussions.

Thanks!

-James Seng

Karlsson Kent - keka wrote:
> 
> Hi!
> 
>         Despite some initial trouble getting on this list,
> I've now been able to subscribe to it.  Thanks Martin!
> I've tried to catch up on the e-mails so far, but I've
> only browsed quickly though them.
> 
>         I've been thinking a bit about how domain names should
> be internationalised, and the text below reflects my current
> thinking about this.  Most of this text was written before
> browsing trough the e-mail archive for this list.  The
> formulations are sometimes as if it is a standards document,
> which it of course isn't.
> 
>         Note that I'm not a DNS expert, but I have some
> knowledge about Unicode.
> 
>                 Kind regards
>                 /Kent Karlsson
> 
> =========================================================
> (Converted from a proprietary document format to plain text.
> Not much touchup has been done after that.)
> 
> Domain name internationalisation
> 
> Draft 0.4
> 
> 2000-01-26
> 
> Kent Karlsson, IMI—Industri-Matematik International
> keka@im.se
> 
> 1       Introduction
> 
> This note is about how Internet domain names should be internationalised.
> It deals with the encoding and restrictions of domain names as sent to a DNS
> (Domain Name Server).  Domain names can of course be stored differently
> inside of documents (e.g. in XHTML documents, or e-mail messages).
> 
> At present Internet domain names are still be restricted to 7-bit ASCII
> (ISO/IEC 646) as sent to a DNS, with some additional rules on which such
> characters are allowed.  HTML, XML, IMAP, FTP, and many other text based
> items on the Internet have already been internationalised in the sense that
> a much wider range of characters are allowed, in particular using the UTF-8
> encoding of Unicode or ISO/IEC 10646-1.  It is high time for domain names to
> be similarly internationalised.
> 
> That the Domain name internationalisation effort should be based on
> Unicode/UTF-8 is taken as a given, as there are no contenders to global
> viability and backwards compatibility with the existing DNS system.
> 
> 2       Unicode vs. ISO/IEC 10646
> 
> Unicode 3.0 and ISO/IEC 10646-1:2000 allocate the same characters at the
> same (abstract) code positions.  They both define a UTF-8 encoding format,
> with a slight difference (see below).  They also both define a UTF-16
> format, but that format is not suitable for domain names as sent to a DNS
> server, taking backwards compatibility into account.
> 
> Unicode (but not ISO/IEC 10646) assigns property codes to characters.  For
> the purposes of this version of domain name internationalisation, both the
> normative and informative general category property assignments of Unicode
> 3.0.0 are considered normative.
> 
> 3       Unicode versioning
> 
> This version of domain name internationalisation is made with Unicode 3.0 as
> a basis.  When new versions of Unicode are issued, one may need to
> re-examine the domain name internationalisation.  Most likely, Unicode 3.0
> will be sufficient for domain name use.
> 
> 4       UTF-8 encoding
> 
> The Unicode UTF-8 format is limited to the first 17 planes, while the
> ISO/IEC 10646 UTF-8 covers 32 768 planes.  For the purposes of this version
> of domain name internationalisation, UTF-8 is limited to plane 0 (the Basic
> Multilingual Plane) only.
> 
> The details of the UTF-8 encoding are not described here.  Please see
> ISO/IEC 10646-1:2000, Annex D, or The Unicode Standard, version 3.0, annex
> ?, or RFC 2044.
> 
> UTF-8 is compatible with 7-bit ASCII, i.e. a 7-bit ASCII string where each
> octet has the 8th bit set to 0 is in UTF-8 already.
> 
> 4.1     Malformed UTF-8 encodings
> 
> Looked-up potential domain names that contain malformed UTF-8 sequences
> shall be rejected by a DNS as unregistered or, optionally, as being in
> error.
> ·      An octet with the value FE or FF is a malformed UTF-8 sequence.
> ·      An isolated continuation octet is a malformed UTF-8 sequence.
> ·      A prematurely terminated UTF-8 sequence is a malformed UTF-8
> sequence.
> ·      An unnecessarily long (for the abstract code point encoded) UTF-8
> sequence is a malformed UTF-8 sequence.
> ·      A UTF-8 sequence for the (abstract) code points FFFE and FFFF are
> malformed UTF-8 sequences.
> ·      A UTF-8 sequence longer than three octets is considered malformed
> for the purposes of this version of domain name internationalisation.
> 
> 4.2     Surrogates
> 
> Surrogate character codes are reserved for use with UTF-16.  These are the
> code points DC00 – DFFF. A UTF-8 sequence for a surrogate character code is
> a malformed UTF-8 sequence.
> 
> 4.3     Private use characters
> 
> Unicode reserves some code points for private use characters.  In plane 0
> (BMP) these are U+E000 – U+F8FF. These are intended for use only by user
> agreement of some kind.
> 
> Private use characters are inappropriate for use in domain names.  A UTF-8
> sequence for a private use character code is considered a malformed UTF-8
> sequence for the purposes of this version of domain name
> internationalisation.
> 
> 5       Unicode general categories
> 
> Unicode assigns general categories (as well as other character properties)
> to characters.  The Unicode 3.0 general categories and their interpretation
> for domain names are discussed in the following sections.
> 
> Unicode regards some of these properties as normative, some as informative.
> For this version of internationalised domain names, all of them are
> considered normative.
> 
> 5.1     Letters, ideographs, and syllable characters
> 
> Lu      Letter, Uppercase       Ok for domain names
> Ll      Letter, Lowercase       Ok for domain names
> Lt      Letter, Titlecase       Ok for domain names
> Lm      Letter, Modifier        Ok for domain names
> Lo      Letter, Other   Ok for domain names
> 
> All of the letters, ideographs, and syllable characters of Unicode 3.0 are
> appropriate for use in domain names.  Note however that a difference in
> letter characters need not imply a difference in domain name.  Canonical,
> compatibility, and case distinctions are to be ignored.  Case distinctions
> are ignored in domain names since the beginning.  Since case is ignored, so
> should the less important compatibility distinctions.  See also clause 6
> below about normalisation.
> 
> 5.2     Combining marks
> 
> Mn      Mark, Non-Spacing       Must not be first, nor after a FULL STOP
> (not the LEFT/RIGHT half ones)
> Mc      Mark, Spacing Combining Must not be first, nor after a FULL STOP
> Me      Mark, Enclosing         Probably inappropriate for domain names
> 
> Used with reason and in moderation, combining marks are ok for use with
> domain names.  Note however that character sequence distinctions that are
> equivalenced by Unicode canonical equivalence do not imply a difference in
> domain name.  See also the clause about normalisation below.
> 
> There are a number of script specific rules on how combining characters
> should be applied.  For the purposes of domain names, we note that they are
> not to come first in any (FULL STOP separated) part of a domain name.   See
> also clause 6 below about normalisation, and clause 7 below about scripts.
> 
> 5.3     Numbers
> 
> Nd      Number, Decimal Digit   Ok for domain names
> Nl      Number, Letter  Ok for domain names
> No      Number, Other   Inappropriate for domain names? (comp. decomp.)
> 
> Many “number” characters are ok for use with domain names.  Note however
> that that many number characters have compatibility decomposition into
> letters, ideographs, or other number characters, and so are equivalent in a
> domain name.  [The “No” characters that do not have a decomposition??]
> 
> 5.4     Punctuation
> 
> Pc      Punctuation, Connector  Inappropriate for domain names (possibly
> with some exceptions, like KATAKANA MIDDLE DOT)
> Pd      Punctuation, Dash       Inappropriate for domain names, except for a
> few characters (see below).
> Ps      Punctuation, Open       Inappropriate for domain names
> Pe      Punctuation, Close      Inappropriate for domain names
> Pi      Punctuation, Initial quote      Inappropriate for domain names
> Pf      Punctuation, Final quote        Inappropriate for domain names
> Po      Punctuation, Other      Inappropriate for domain names, except for a
> few characters (see below).
> 
> Domain name rules have always excluded punctuation characters, except for
> FULL STOP, which is given special significance within domain names.  MIDDLE
> DOT and HYPHEN (or HYPHEN-MINUS) may need to be considered to be allowed.
> 
> Punctuation has been excluded from domain names proper, since some (not all)
> punctuation characters in 7-bit ASCII has been used for other purposes near
> domain names.  E.g. @, !, /, :, and % have special meanings near domain
> names in many contexts.  Other punctuation is reserved for present or
> possible future use near domain names.
> 
> BiDi and FULL STOPs (and @s)??
> 
> 5.5     Symbols
> 
> Sm      Symbol, Math    Inappropriate for domain names
> Sc      Symbol, Currency        Inappropriate for domain names
> Sk      Symbol, Modifier        Inappropriate for domain names?
> So      Symbol, Other   Inappropriate for domain names (comp. decomp.?)
> 
> As the case for punctuation, symbols are inappropriate for use with domain
> names.
> 
> 5.6     Separators
> 
> Zs      Separator, Space        Inappropriate for domain names
> Zl      Separator, Line         Inappropriate for domain names
> Zp      Separator, Paragraph    Inappropriate for domain names
> 
> Spaces and similar separators (like LINE FEED) have always been considered
> inappropriate for use in domain names.  Unicode has many more different
> space characters than ASCII, and it also has new line/paragraph separation
> characters.
> 
> 5.7     Other characters
> 
> Cc      Other, Control  Inappropriate for domain names
> Cf      Other, Format   Inappropriate for domain names (mostly??)
> Cs      Other, Surrogate        Inappropriate for domain names
> Co      Other, Private Use      Inappropriate for domain names
> Cn      Other, Not Assigned     Inappropriate for domain names in this
> version
> 
> Control, format, surrogate, and private use characters are inappropriate for
> use in domain names.  For this version of internationalised domain names,
> (abstract) code points that were unassigned in Unicode 3.0 are
> inappropriate.
> 
> Note that the class Cf includes ZERO WIDTH NO-BREAK SPACE, which can be used
> as a “signature” when at the beginning of a string.  This use is also
> inappropriate for domain names.
> 
> 5.8     The Plane 14 suggestion
> 
> The “language tag” characters, that are suggested to be allocated in plane
> 14, see Unicode technical report number 7, are inappropriate for use in
> domain names.
> 
> 5.9     ISO/IEC TR 10176 AMD 1
> 
> The technical report ISO/IEC TR 10176 (Guidelines for the preparation of
> programming language standards) in its revised (soon to be AMD 1) annex
> lists characters that at a minimum should be accepted in programming
> language identifiers.  It does so for a “level 2 implementation” of ISO/IEC
> 10646.  A domain name is similar to an “identifier” in a programming
> language, so what 10176 lists in its (revised!) Annex A should at least be
> considered.
> 
> See PDAM text at http://std.dkuug.dk/jtc1/sc22/wg20/docs/n699.pdf.
> Note that this TR (as amended in what will be AMD 1) is based on Unicode
> 2.1, not Unicode 3.0.  An AMD 2, etc., is promised to only extend what is in
> AMD 1.  Note also that compatibility forms are excluded from the lists in
> AMD 1, but programming languages may of course allow both compatibility
> forms and “level 2” combining marks.  Nothing is said in AMD 1 about
> normalisation.
> 
> ISO/IEC TR 10176 PDAM 1 is supported by the Unicode consortium, and is their
> (and SC22/WG20s) correction to the original list.  The original list should
> be considered defective.
> 
> 6       Normalisation for domain names
> 
> 6.1     Case normalisation
> 
> Internet domain names have been case insensitive from the start.  When
> extending the allowed characters in domain names, it would be unwise to
> either abandon case insensitiveness or restrict it to just the ASCII part.
> Instead, this principle should be extended to the new characters allowed in
> domain names.  However, there are some problems with this.  First, the case
> mappings documented by the Unicode consortium are only informative, not
> normative.  Second, there are some known exceptions: like that for Turkish i
> and dotless i.  Third, for several more cases the case mapping is not 1 to
> 1, e.g. sharp s (ß; U+00DF) maps to uppercase SS, mapping that back to
> lowercase gives ss.  There are several other such cases. [not sure exactly
> what to do with these]
> 
> Unicode Technical Report number 21 [UTR21] describes one way of doing this
> [is that appropriate? Any better way of doing this?] SHARP S, YPOGEGRAMMENI,
> PROSGEGRAMMENI?  Map to lowercase? Map to uppercase? tolower(toupper(x))?
> UTR 21 (with the associated data file CaseFolding.txt) essentially
> (exactly?) implies tolower(toupper(x)) (see also below); dotless i might not
> be handled the way desired (in Turkey), nor is sigma and other letters with
> final forms.
> 
> 6.2     Unicode normalisation
> 
> Canonical distinctions, in the Unicode sense, shall be ignored.
> Since case distinctions should be ignored, compatibility distinctions should
> most certainly be ignored too.  Compatibility distinctions can be normalised
> away with the same algorithm as canonical distinctions are normalised away.
> Normalisation form KC (compatibility decomposition, logically followed by
> canonical composition), see Unicode Technical Report number 15 [UTR15],
> should be used for domain names, at least at registration time, if not at
> lookup time.  Among a few other things, this maps WIDE, NARROW, and
> PRESENTATION FORM characters to their nominal corresponding character.
> 
> It is the resulting character string after KC normalisation for which the
> category test above is referring to.
> 
> Normalisation KC by itself does not imply any case normalisation.
> Note that normalise(KC, casefold(x)) is not the same as
> casefold(normalise(KC, x)), if casefold follows CaseFold.txt.
> 
> 6.3     Further normalisation
> 
> FINAL SIGMA, FINAL KAF, FINAL MEM, FINAL NUN, FINAL PE, FINAL TSADI, FINAL
> SEMKATH, BOPOMOFO FINAL *? Suggestion: ignore ‘finality’, i.e., consider to
> them be equivalent with their corresponding ‘ordinary’ version.
> 
> [Funny, CaseFolding.txt maps all sigmas to final(!) sigmas; but does nothing
> for other ‘final’ characters.]
> 
> Map HYPHEN, NO-BREAK HYPHEN, and * DASHes to HYPHEN-MINUS? Remove * SOFT
> HYPHEN and ZWSP?
> 
> “New line function” ‘normalisation’ (see UTR 13) does not apply to domain
> names, since no domain name is to have any such character in it.
> 
> 6.4     A possible alternative to normalisation: collation weighting
> 
> A possible alternative to do KC and case normalisation is to use the ISO/IEC
> 14651 CTT (common template table), or the UTR 10 associated tables, with
> some tailoring suitable for the DNS (no, NOT local ones).  In particular,
> punctuation and symbols must be significant at level 1.  Then determine
> equality up to and including level 2 (accents; similar), but not level 3
> (case; hira/kata, various compatibility distinctions).
> 
> This is also based on Unicode 2.1, not yet Unicode 3.0.  Also, there is at
> present NO promise not to do changes that may affect, to some degree, use of
> the weightings that result.  In particular, for 14651 no particular weight
> VALUES are assigned.  That up to each implementation.  For the UTR 10
> tables, the actual weight values may change at any update (or in any
> suitable way by tailoring, or other implementation decisions), so different
> versions cannot be used in a mix.  Finally, there is no resulting “normal
> form” character string from these weight tables.
> 
> 7       One should not mix scripts between FULL STOPs
> 
> It is not a good idea to mix scripts freely in a single “part” of a domain
> name.  E.g., it would be very confusing if an initial A is a Greek A, while
> the rest of the name part is in the Latin script.
> 
> However, what constitutes a script is not clearly defined, and some
> orthographies (like the Japanese) normally do mix “scripts” in a single
> “word”.  Therefore this must be left for human judgement.  For an automated
> service one may apply some heuristic on suggested names that may need human
> scrutiny, or reject doubtful cases for registration.  Note also that ASCII
> digits can be used with any other script, and many of the combining
> non-spacing marks are script generic, i.e. can be used with several
> different scripts.
> 
> No rigid scheme should be applied for this.  It should only be a
> registration time heuristic, overrideable by human intervention.
> 
> 8       &-encoding (XML), %-encoding (URL), and =-encoding (QP)
> 
> Any &-encoding used in XML (or HTML) documents in a string that contains a
> domain name shall be decoded before sending the domain name to a DNS system.
> Note that XML &-codes are character oriented and independent of the
> character encoding used for the XML document itself.
> Any %-encoding in a URL shall not be decoded in the domain name part, and %
> as such is not legal in a domain name.  Such a domain name is thus
> malformed.  The % character may mean something else though, so no attempt at
> URL %-decoding shall be done at that point.  In addition, the octet oriented
> (not character oriented) %-encoding is for an unknown character encoding,
> and any attempt at decoding it by the client is likely to be in error.
> 
> Any =-encoding in an e-mail in Quoted-Printable shall be decoded according
> to the charset declaration of the message.  Hopefully, Quoted-Printable will
> go out of use, so this should be less of a problem...
> 
> 9       E-mail address internationalisation
> 
> The pre-@ part of e-mail addresses should be internationalised in the same
> way as domain names are internationalised.
> 
> ====================================================================