[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Using a new class for IDN



Dan Oscarsson <Dan.Oscarsson@trab.se> wrote:

> - The count of characters that can fit into 63 octets differ when
>   using ACE-names and native UCS-names.

True.  As an extreme example, consider a label consisting of many
repetitions of the same character outside plane 0.  UTF-8, UTF-16, and
UTF-32 all use 4 octets per character, while Punycode uses about 1.

As an extreme example the other way, consider a label consisting of
random characters from plane 0.  UTF-16 uses 2 octets per character,
while Punycode uses about 3.5.

> To make things easier for the future, IDNA should require that the IDN
> in the ToUnicode form must not be longer than 63 octets.

ToUnicode does not output octets, it outputs code points.  Which
encoding form did you have in mind, UTF-8, UTF-16, or UTF-32?

UTF-32 is always at least as large as UTF-16, sometimes larger, so I'll
assume you don't want that one.

If you go with UTF-16, then all existing ASCII labels over 31 characters
become retroactively invalid, which seems very bad.

If you go with UTF-8, then Indian scripts can fit only 21 characters per
label, versus about 40 for ACE.  It's seems a shame to halve the limit
for a billion users.  I'd really rather not.

AMC