[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: host names and nameprep (was: Re: [idn] IRIs ought to use internationalized *host* names)



Roozbeh Pournader <roozbeh@sharif.edu> wrote:

> I had a lot of debates with Persian experts here in Iran about ZWNJ in
> domain names.

Oh, I thought we were talking about zero-width space.  Zero-width
non-joiner is not a space character (class Zs), but a formatting
character (class Cf).  Although ASCII host names do not allow characters
of broad-class C, the only such characters that exist in ASCII are
control characters (class Cc).  So it's not obvious whether any/some/all
characters of class Cf should be allowed in host labels.

Let's see what nameprep does with characters of class Cf...

prohibited:

     070F        SYRIAC ABBREVIATION MARK
     180E        MONGOLIAN VOWEL SEPARATOR
     200E..200F  [BIDI FORMATTING]
     202A..202E  [BIDI FORMATTING]
     206A..206F  [SWAPPING/SHAPING]
     FFF9..FFFB  [INTERLINEAR ANNOTATIONS]
    1D173..1D17A [MUSICAL SYMBOLS]
    E0001        LANGUAGE TAG
    E0020..E007F [TAGGING CHARACTERS]

mapped out:

     200C        ZERO WIDTH NON-JOINER
     200D        ZERO WIDTH JOINER
     FEFF        ZERO WIDTH NO-BREAK SPACE

unassigned:

     2060        WORD JOINER
     2061        FUNCTION APPLICATION
     2062        INVISIBLE TIMES
     2063        INVISIBLE SEPARATOR

kept:

     06DD        ARABIC END OF AYA

What makes 06DD special?  It was class Me in Unicode 3.1, and became
class Cf in Unicode 3.2, but nameprep is based on Unicode 3.1.

Anything prohibited by nameprep should also be prohibited in host
labels.  The question is which, if any, of the Cf characters allowed
by nameprep (U+06DD, U+200C, U+200D, U+FEFF) should be allowed in host
labels.

AMC