[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] Change request for cidnuc



Hello,

Please consider the following suggestions for improvement to CIDNUC.

--

Rather than "wg4", I suggest the more distinctive "--" preceded by a single
letter "a" to "z".  Currently "a" to "c" to be used and indicate which form of
CIDNUC.  To allow future proofing, letters "d" to "z" are reserved for
potential later use.

--

Currently CIDNUC has considered compact encoding for Asian and for scripts like
Cyrillic and Greek.  However, for accented Latin the compression is poor.  This
change request addresses the problem (and allows Latin labels to be up to
(63-3)/2 = 30 letters long).

--

A label or username can be encoded in one of four ways.  Considering the two
octets of the string in UTF-16 and using the notation that L10 is the lowest 10
bits and L8 is the lowest 8 bits and H8 is the highest 8 bits:

1) if string is only a-z 0-9 and hyphen then no encoding applied

2) else if all high octets are 0x01 0x02 or 0x03 (e.g. string is Latin
supplement/extended-A/etc), then encode as follows:
 "c--" base32(L10 L10 L10 ...)

3) else if all high octets are equal (e.g. string Greek/Cyrillic/etc), then
encode as follows:
 "b--" base32(H8 L8 L8 L8 ...)

4) else (e.g. Asian/etc), encode as follows:
 "a--" base32(H8 L8 H8 L8 H8 L8 ...)

--

                    Base32 conversion
        bits   char  hex         bits   char  hex
        00000   9    0x61        10000   p    0x71
        00001   a    0x62        10001   q    0x72
        00010   b    0x63        10010   r    0x73
        00011   c    0x64        10011   s    0x74
        00100   d    0x65        10100   t    0x75
        00101   e    0x66        10101   u    0x76
        00110   f    0x67        10110   v    0x77
        00111   g    0x68        10111   w    0x78
        01000   h    0x69        11000   x    0x79
        01001   i    0x6a        11001   y    0x7a
        01010   j    0x6b        11010   z    0x32
        01011   k    0x6c        11011   2    0x33
        01100   l    0x6d        11100   3    0x34
        01101   m    0x6e        11101   4    0x35
        01110   n    0x6f        11110   5    0x36
        01111   o    0x70        11111   6    0x37

(0 and 1 never to be used.  7 and 8 and - reserved for possible future use.)

--

Example for encoding 2:

d{"u}rst@w3.org

c--cdg3crcsct@w3.org

(or it could equally be written: c--cDg3cRcScT@w3.org)


Another example for encoding 2:

www.tre-feli{^c}a.ie

www.c--ctcrceamcfceclcihica.ie



--

regards,
Aaron Irvine

--

-----------------------------------------------------
Aaron Irvine
  mailto:airvine@corp.phone.com
-----------------------------------------------------