[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Punicode: Upper-case in example



Martin Duerst <duerst@w3.org> wrote:

> In http://www.ietf.org/internet-drafts/draft-ietf-idn-punycode-03.txt,
> example (I) says:
> 
>  (I) Russian (Cyrillic):
>         U+043F u+043E u+0447 u+0435 u+043C u+0443 u+0436 u+0435 u+043E
>         u+043D u+0438 u+043D u+0435 u+0433 u+043E u+0432 u+043E u+0440
>         u+044F u+0442 u+043F u+043E u+0440 u+0443 u+0441 u+0441 u+043A
>         u+0438
>         Punycode: b1abfaaepdrnnbgefbaDotcwatmq2g4l
> 
> The presence of the upper-case 'D' (not to say the string 'Dot' :-)
> is confusing, because it seems completely arbitrary.  There is no
> upper-case letter in the Cyrillic string.
> 
> Is the result string actually correct?

Yes.  Uppercase and lowercase letters are equivalent in Punycode.
Section 5 says:

        41..5A (A-Z) =  0 to 25, respectively
        61..7A (a-z) =  0 to 25, respectively

    A decoder MUST recognize the letters in both uppercase and lowercase
    forms (including mixtures of both forms).

> Has it been generated mechanically?

Yes.  The sample encoder (that appears in the Punycode spec) was fed
the Russian text in u+ form (exactly as it appears in example I) and it
produced the Punycode encoding (exactly as it appears in example I).  If
the Punycode string (exactly as it appears in example I) is fed to the
sample decoder, it will produce the Russian text in u+ form (exactly as
it appears in example I, except for line breaks).  If you have an ANSI C
compiler, you can reproduce these results yourself.

If you have another Punycode implementation, it won't necessarily agree
with the sample implementation on the case of the letters in the encoded
string, but that's not significant.  Any correct encoder will produce
the same ASCII string (ignoring case), and any correct decoder will
produce the same numeric code points.

> How did the upper-case D get in there?

It corresponds to the uppercase U in one of the code points in the u+
notation.  The sample Punycode implementation uses the case of the u
as a 1-bit annotation.  The sample encoder uses these annotations to
determine the case of the letters in the Punycode string, and the sample
decoder recovers the annotations, using them to determine the case of
the u's in the u+ notation.

Section 5 says:

    An encoder SHOULD output only uppercase forms or only lowercase
    forms, unless it uses mixed-case annotation (see appendix B).

Appendix B says:

    ...mixed-case annotation is not used by the ToASCII and ToUnicode
    operations specified in [IDNA], and therefore implementors of IDNA
    can disregard this appendix.

    These annotations do not alter the code points returned by decoders;
    the annotations are returned separately, for the caller to use or
    ignore.  Encoders can accept annotations in addition to code points,
    but the annotations do not alter the output, except to influence the
    uppercase/lowercase form of ASCII letters.

    Punycode encoders and decoders need not support these annotations,
    and higher layers need not use them.

AMC