[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] UTC feedback



The Unicode Technical Committee discussed the issue of DNS names at its meeting
last week, and had some recommendations. I will try to summarize the points
brought out during the discussion. If you have any questions, please let me know.

1. The committee is in favor of the canonicalization model of

Filter -> Fold -> NFKC -> Serialize

- The filter step would reject certain characters, thus causing the name to be
illegal.

- The fold step would fold characters together (e.g. case mapping), or fold
characters away (e.g. delete a character by folding it to a null string).

- The NFKC step puts the string into normalized form, as defined by UTR#15

- The Serialize produces a reversible mapping to a sequence of bytes.


2. Filter Closure

The Filter must be closed under folding and normalization. That is, suppose that
NFKC( Fold( Filter( x ) ) contains y. Then, if ( isFiltered( y ) == REJECT ) then
( isFiltered( x ) == REJECT ).

In the process of developing the filter, you start with an original filter, and
programmatically add all characters that would be canonicalized into characters
that would be rejected by the original filter. This is, of course, not done at
runtime; it is just a formal constraint on the Filter.

Thus if the original filter rejects U+2044 FRACTION SLASH, then the closure of
that filter must reject U+00BC VULGAR FRACTION ONE QUARTER (since the latter is
canonicalized to a string that contains U+2044).


3. For Fold, the candidates include:

Case: use case folding

data: http://www.unicode.org/Public/3.0-Update1/CaseFolding-2d1.beta.txt
visual: http://www.unicode.org/unicode/reports/tr21/charts/ (for a visualization)

Dashes: map characters with General Category Pd to U+002D

data: http://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.txt
visual: http://www.unicode.org/unicode/reports/tr24/charts/ScriptChart14.html

Spaces: map characters with General Category Zs to U+0020*

data: http://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.txt
visual: http://www.unicode.org/unicode/reports/tr24/charts/ScriptChart7.html

* Only necessary if these are not filtered.

(The UTC did not discuss whether it would be advisable to fold away Hebrew accents
or points marks: the characters in:

U+0591 HEBREW ACCENT ETNAHTA
..
U+05C4 HEBREW MARK UPPER DOT

see #6).


4. Serialize. The UTC did not discuss the Serialize phase in any detail. In
general, the consortium does not favor new transformations (beyond UTF-8, UTF-16,
and UTF-32). However, it recognizes that there may be additional constraints in
particular environments such as for DNS names that warrant using a novel
transformation, such as a base-36 approach.


5. Language-independence. Based on the extensive internationalization experience
of its membership, the technical committee believes strongly that having either
language-dependent canonicalization or allowing multiple character encodings would
be disastrous. The committee recognizes that the canonicalization may not be
optimal for all languages, but (a) the benefits of uniformity far outweigh the
drawbacks in a few cases, (b) there are work-arounds in many cases, (c)
usersarealreadyusedtorestrictionson DNS names that in most cases represent far
more problems for legibility (e.g. the lack of space).

In the case of (b) for example, French users may be accustomed to either having
accents or not in uppercase. Yet in other languages, the distinction between
accented letters must be maintained -- folding them would be like folding all
vowels to E in all English words. An acceptable work-around is to register both
the name with all accents and the name with none.


6. A subcommittee was formed to look at this issue in more detail; in particular,
recommendations for Filter (and perhaps Fold). The committee agreed that the
inclusion of characters into these steps must be based on principles, so that as
new characters are added to the standard, the principles can be applied to those
new characters.