[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Using a new class for IDN

To: idn@ops.ietf.org
Subject: Re: [idn] Using a new class for IDN
From: "Adam M. Costello" <idn.amc+0@nicemice.net.RemoveThisWord>
Date: Sun, 2 Jun 2002 21:34:19 +0000
In-reply-to: <3CF9DE37.1A360FF9@trab.se>
References: <3CF9DE37.1A360FF9@trab.se>
Reply-to: IETF idn working group <idn@ops.ietf.org>
User-agent: Mutt/1.3.28i

Dan Oscarsson <Dan.Oscarsson@trab.se> wrote:

> - The count of characters that can fit into 63 octets differ when
>   using ACE-names and native UCS-names.

True.  As an extreme example, consider a label consisting of many
repetitions of the same character outside plane 0.  UTF-8, UTF-16, and
UTF-32 all use 4 octets per character, while Punycode uses about 1.

As an extreme example the other way, consider a label consisting of
random characters from plane 0.  UTF-16 uses 2 octets per character,
while Punycode uses about 3.5.

> To make things easier for the future, IDNA should require that the IDN
> in the ToUnicode form must not be longer than 63 octets.

ToUnicode does not output octets, it outputs code points.  Which
encoding form did you have in mind, UTF-8, UTF-16, or UTF-32?

UTF-32 is always at least as large as UTF-16, sometimes larger, so I'll
assume you don't want that one.

If you go with UTF-16, then all existing ASCII labels over 31 characters
become retroactively invalid, which seems very bad.

If you go with UTF-8, then Indian scripts can fit only 21 characters per
label, versus about 40 for ACE.  It's seems a shame to halve the limit
for a billion users.  I'd really rather not.

AMC

References:
- [idn] Using a new class for IDN
  - From: Dan Oscarsson <Dan.Oscarsson@trab.se>

Prev by Date: [idn] Re: Last Call: Preparation of Internationalized Strings
Next by Date: [idn] utf8/legacy versioning
Previous by thread: [idn] Using a new class for IDN
Next by thread: Re: [idn] Using a new class for IDN
Index(es):
- Date
- Thread