[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] URL encoding in html page



David Leung stated:

> > > Agrees, the design of how UTF-8 encodes and decodes can be expanded to
> > > support more bits... so that's why UTF8 should be reasonable to be the
> > long
> > > term solution for designing i18n applications, including IDN...

and Doug Ewell responded:

> > Please, everybody, read the definition of UTF-8 before making these
> > claims that it supports an encoding structure wider than 31 bits.  (And
> > please read the statements of WG2 and Unicode concerning expansion
> > beyond U+10FFFD.)

and then David Leung retorted:

> The word "EXPAND" here means it can be extended or modify to cope with
> future requirements over the limits now... i didn't say that it is "READY"
> to support more... any one that reads carefully show see the words "can be
> expanded" and not "can be able" : )

I'm presuming that David is referring to the fact that the byte values
0xFE and 0xFF are unused in UTF-8. Currently the longest UTF-8 forms
defined in 10646 are 6-byte sequences, and the maximum value is
FD BF BF BF BF BF, corresponding to the code point U-7FFFFFFF. If you
took the same bit-slicing scheme and used 0xFE and 0xFF as introducers for
7-byte sequences, you could conceivably extend the UTF-8 scheme from
representing 31-bit code points to representing 37-bit code points, i.e.
covering a code space 0..1FFFFFFFFF, or 128 gig of code points, instead
of the puny 2 gig of code points theoretically available to 10646 now.

Well, to paraphrase a former president, yes, we *could* do that, ... but
it would be *wrong*!

The problem is that envisioning this kind of expansion misconstrues
the problem.

The *real* problem is guaranteeing interoperability for UTF-8, UTF-16,
and UTF-32, which are the three sanctioned encoding forms of Unicode --
all of them seeing real use in widespread applications. To ensure this,
the UTC and WG2 are busy *constraining* the code space -- not worrying
about opening it up to multiple gigabytes of numbers because someone
is concerned that someday aliens may land. The *useful* code space
is 0..10FFFF, the 17 planes of 10646 accessible via UTF-16.

The recurrent urban folk legends that Unicode doesn't have enough
code points (1,114,112 to be exact), or that UTF-16 is insufficient,
or that UTF-8 may have to be expanded, and so on, are just that: urban
folk legends. They are driven by people overly impressed by the CJK
character numbers, failing to do their research on the public Unicode
roadmap pages:

http://www.unicode.org/roadmaps/

and are apparently coupled with a tendency to profound innumeracy.
There are still wide-open spaces in Plane 1 of the Roadmap, even
after nearly every historic script known to man is laid out with
tentative allocations. Even if CJK spills over Plane 2 and starts
filling up Plane 3, that still leaves 10 planes (655,340 code points,
not counting 20 noncharacters) totally unaccounted for, with no
viable candidate characters anywhere to even start filling them.

And except for the occasional burps of CJK characters, the relevant
character encoding committees, the UTC and WG2, are managing to
process approximately 1000 characters per year for new standardization
as part of 10646 and the Unicode Standard. You do the math and tell
me at that rate how long it will take to standardize over 700,000
characters. The Internet and the IETF will be forgotten ancient
footnotes long before anybody manages to standardize 700,000 more
characters.

--Ken