[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] Versions of Nameprep



(This should probably have been discussed in the nameprep design team
first, and that was my intention, after the IETF, but now when this
was discussed I'll post it to the whole IDN mailing list instead...)

Summary:

     By introducing 2 different policies for use of unassigned codepoints
     in Unicode, i.e. codepoints being unassigned according to
     newest version of the NamePrep specification.we do not
     need versioning in the IDN protocol.

    (a) Registries are never registering domains with unassigned
        codepoints.
    (b) Clients let unassigned codepoints pass through without
        any modification.

Now, here is the full rationale:

Versions of Unicode.

- Since Unicode will be updated fairly frequently, we also want to allow new
characters to be used as soon as they are defined.
- We do this by updating NamePrep.
- We want to allow the maximal compatibility between systems running
different versions of NamePrep.

Mechanism.

In a particular version of NamePrep, the following lists of code points can
be generated based on the version of Unicode and the mapping/prohibition
tables.

AIO. code points allowed in the input and in the output
AI. code points allowed in the input, but not in the output
D. assigned code points that are disallowed completely (input and output --
includes noncharacters, unpaired surrogates, etc.)
U. unassigned code points

Note: the reason that AI exists is that some characters will disappear or be
transformed in mapping or normalization, so they can appear in the input,
but will never be in the output.

In any subsequent version of NamePrep, because of updates to Unicode, code
points from U will move to D, AI or AIO.

Policies.

Registrars are forbidden to register any IDNs containing code points outside
of AIO for the latest version of Unicode / NamePrep. That is, they are
forbidden to register any IDNs containing AI, D or U code points. (In
addition, the allowable names must be in canonical order!)

Clients should treat U code points as if they were AIO as they are processing
IDNs as a part of NamePrep. Some certain applications might though be
implemented to treat them as U, or AIO after first warning the user
about the fact that the character is of class U -- all based on the
use of the domainname in the specific application.

Intermediaries may reject names that are not in canonical order, or that
contain code points that are in their versions of AI or D, but must not
reject names for containing U.

   - Character X moved to AIO. By passing the characters through as is,
     the client will end up at the correct service.
   - Character X is normalized to character Q and therefore the character
     is moved to AI. If the user enter character Q, he will end up at the
     correct service, but if he enters X, he will not reach any service at
     all, as X can not exist in a registered domainname.
   - Character X is moved into D. This can not exist in any domainname
     either, so entering X makes the service not reach any service.
   - Characters XY is specified to be ordered YX. If the user enters
     YX, he will reach the correct service, but no domainname will be
     registered with the characters in the order XY, so entering XY
     will not make the user reach any service.

As we see in the table above, what happens if the client enters class
U characters which are registered in a newer version of the NamePrep
document is that the client either reach the domainname that was
intended, or no domain at all (even though he should, and this is
because normalization and ordering rules are not updated in the
client).

In no case will the client reach a domainname which is registered as
a different domain than the one which the client is attempting to
reach.

Scenarios.

This will provide for compability in the following ways:

A. Suppose that a client or intermediary is on Unicode 3.1 and the site is
on Unicode 3.0. This case is simple: there will be no domains on the site
that can't be accessed by the client, since the client uses a superset of
the code points accepted by the site.

B. Suppose that a client or intermediary is on Unicode 3.0 and the site is
on Unicode 3.1. Because the client NamePrep passed through any unassigned
character, the user can access domains on the site that use characters in
Unicode 3.1. No domains on the site can have code points that are unassigned
in 3.1, since that is illegal.

The restrictions in case B are that the client has to type the characters in
the right order, and has to use the post-mapped, post-normalized code
points.

Example1: domain is XYZW.com. Y and Z are combining marks, in canonical
order. Y is not in Unicode 3.0. If the client is on 3.1, XZYW will normalize
to XYZW.com, so the user can type either one. The 3.0 client must type
precisely XZYW.com or the name will be rejected.

Example2: domain is BCDE.com. C is the normalized form of Q. A Unicode 3.1
client can type either BQDE.com or BCDE.com; both will work. The 3.0 client
can only type BCDE.com.

--