[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] compatibility chars in draft-ietf-idn-nameprep



I believe that all characters specified by the Unicode
standard as compatibility characters should be prohibited.
(But not all non-compatibility characters should be allowed.)

I argue below [search for "PROHIBITED"] that folding them either
in nameprep or in the DNS server produces bugs. They must be
prohibited outright.

Here are some reasons to prohibit or fold some types of
compatibility characters:

    1) Those that exist in order to preserve a code point
    in a round-trip mapping from a legacy character set
    through Unicode back to the legacy character set. As
    users exchanging domain names have no knowledge of legacy
    character sets, no such characters should be permitted.
    The vast majority of compatibility characters (including
    the fullwidth and halfwidth variants) are in this class.

    2) Font variants, such as the script-f or blackboard
    bold N. These are used for two distinct purposes (i)
    as font variants in text, e.g. in the photographic
    notation f/64, and (ii) as symbols, e.g. as the currency
    symbol for the Netherlands. (Mathematical uses can be
    considered a font variant.) Providing such font shifts
    is inconsistent with the plain text nature of DNS; the
    characters provided aren't enough to implement mathematical
    typesetting (at least in Unicode 2.x) and are there
    mostly to preserve round-trip mapping. Some current
    typesetting systems (e.g., classified ads containing email
    addresses) still can't handle these characters, and such systems
    may persist for years in countries which use the Roman script;
    in practice, domains would be incorrectly written in a plain
    font as most users would assume the plain and fancy
    characters to be equivalent.
    
    3) The compatibility characters in the ASCII range -
    tilde, underscore and back-quote - are all ones which
    any non-programmer would consider suspect.
  
    4) Ones which are simply duplicates for no good reason
    at all, such as the extra Greek mu at 0xb5. (My guess is
    that someone from a non-metricated country argued that
    a micro sign is not a mu. Do we really want to repeat
    all such arguments here?)

    5) Ligatures, such as the "ff" ligatures used in
    typesetting English using "Times Roman"-like fonts.
    These are not sufficient to typeset languages other
    than English or even English using some other font
    styles. As they cannot be represented faithfully
    in all media (in particular, on all GUI screens) and
    would cause confusion in handwritten notes, they
    should be prohibited.
    
    6) The small forms block from 0xfe50 to 0xfe6f.
    These are smaller versions of punctuation characters
    and would cause confusion even when typeset.
  
It is very easy to miss code points when compiling lists
of characters to be prohibited. Indeed, the draft misses the
small punctuation from 0xfe50 to 0xfe6f and the compatibility mu
at 0xb6. Unicode has been over this ground and done it well
IMO - about all I can find that they missed are the z-bars
at 0x1b5-6, and that's probably been fixed in 3.0. It took
them years and they knew what they were doing.

As a special case, the Korean Jamo syllables might be allowed,
otherwise the resulting UTF-8 strings would be very long. If
we are stuck with the 63-octet limit, then Korean may require
special processing. But this can be regarded simply as a
form of compression. It can be done algorithmically, as the
Hangul are arrangements of Jamo syllables which are made up
of letters.

Note that a user application that runs in a legacy character
set and accepts domain names as input can apply compatibility
decomposition to "fold" the characters - a native Unicode
one is unlikely to generate them from keyboard input anyway.

Now, WHY THEY SHOULD BE PROHIBITED, not just folded:

The DNS server *must not fold* compatibility characters, otherwise
software which is working in or converting from a legacy character
set would get inconsistent results. For example, a Japanese
application that does something with host names and writes
them to a file ought to see that a name containing a fullwidth
"A" *does not exist*, so that it doesn't write it to some
file which is then later read by other software. If that next piece
of software were a Unicode-native database, it would treat
the fullwidth "A" as a different character from the normal "A",
and a database lookup under the host's real name (containing
the normal "A") would not find it. (Normal Unicode string
comparison says the compatibility character is not equal to
its decomposition, but that canonical equivalents (e.g. a-umlaut
and <a,umlaut>) are equal to each other.) In a very real sense,
compatibility characters are second class characters - they
don't really exist, and should not be folded. The DNS server
must simply reject them (trivially, by them never being
in names which are created).

If the document is amended to say "all Unicode compatibility
characters are prohibited", the list of prohibited characters
is greatly shortened from what is currently in the document,
and may be reducible to a small stable set.