[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] URL encoding in html page



I am sorry I bring this up.

I bought up the point on 128bit clean for applications to illustrate my
point that one cannot accurately know what is "long term". Maybe 128bit is
not needed for encoding, maybe it is needed for IPv6, and something I don't
know.

It is difficult to do standardisation for something that is coming in the
"long term" cos no one really knows what future looks like. We can only do
standardization based on the things we know now. This is a "backward"
looking view to standardisation.

Another group of people advocate that standardization should be "forward"
looking, driving the adoption of new technologies. This is also important.

The conflict between these two group has always exist in this wg or others
in IETF, one claim the other impractical, the other claim one is
conservative. The difficulty is the balancing of the two "backward" and
"forward" looking view.  Doug hit the nail on the wall when he asked "How
far into the future?".

(And now try imaging you sit between theses two group of people, trying to
guage rough consensus of the whole group :-)

Anyway, let me concluded to say that this is my fault to bring up 128 bit.
Lets drop this and move back to our other discussion, such as IRI and Host:,
copy & paste etc.

-James Seng

----- Original Message -----
From: "Kenneth Whistler" <kenw@sybase.com>
To: <david@neteka.com>
Cc: <idn@ops.ietf.org>; <kenw@sybase.com>
Sent: Tuesday, April 02, 2002 7:47 AM
Subject: Re: [idn] URL encoding in html page


> David Leung stated:
>
> > > > Agrees, the design of how UTF-8 encodes and decodes can be expanded
to
> > > > support more bits... so that's why UTF8 should be reasonable to be
the
> > > long
> > > > term solution for designing i18n applications, including IDN...
>
> and Doug Ewell responded:
>
> > > Please, everybody, read the definition of UTF-8 before making these
> > > claims that it supports an encoding structure wider than 31 bits.
(And
> > > please read the statements of WG2 and Unicode concerning expansion
> > > beyond U+10FFFD.)
>
> and then David Leung retorted:
>
> > The word "EXPAND" here means it can be extended or modify to cope with
> > future requirements over the limits now... i didn't say that it is
"READY"
> > to support more... any one that reads carefully show see the words "can
be
> > expanded" and not "can be able" : )
>
> I'm presuming that David is referring to the fact that the byte values
> 0xFE and 0xFF are unused in UTF-8. Currently the longest UTF-8 forms
> defined in 10646 are 6-byte sequences, and the maximum value is
> FD BF BF BF BF BF, corresponding to the code point U-7FFFFFFF. If you
> took the same bit-slicing scheme and used 0xFE and 0xFF as introducers for
> 7-byte sequences, you could conceivably extend the UTF-8 scheme from
> representing 31-bit code points to representing 37-bit code points, i.e.
> covering a code space 0..1FFFFFFFFF, or 128 gig of code points, instead
> of the puny 2 gig of code points theoretically available to 10646 now.
>
> Well, to paraphrase a former president, yes, we *could* do that, ... but
> it would be *wrong*!
>
> The problem is that envisioning this kind of expansion misconstrues
> the problem.
>
> The *real* problem is guaranteeing interoperability for UTF-8, UTF-16,
> and UTF-32, which are the three sanctioned encoding forms of Unicode --
> all of them seeing real use in widespread applications. To ensure this,
> the UTC and WG2 are busy *constraining* the code space -- not worrying
> about opening it up to multiple gigabytes of numbers because someone
> is concerned that someday aliens may land. The *useful* code space
> is 0..10FFFF, the 17 planes of 10646 accessible via UTF-16.
>
> The recurrent urban folk legends that Unicode doesn't have enough
> code points (1,114,112 to be exact), or that UTF-16 is insufficient,
> or that UTF-8 may have to be expanded, and so on, are just that: urban
> folk legends. They are driven by people overly impressed by the CJK
> character numbers, failing to do their research on the public Unicode
> roadmap pages:
>
> http://www.unicode.org/roadmaps/
>
> and are apparently coupled with a tendency to profound innumeracy.
> There are still wide-open spaces in Plane 1 of the Roadmap, even
> after nearly every historic script known to man is laid out with
> tentative allocations. Even if CJK spills over Plane 2 and starts
> filling up Plane 3, that still leaves 10 planes (655,340 code points,
> not counting 20 noncharacters) totally unaccounted for, with no
> viable candidate characters anywhere to even start filling them.
>
> And except for the occasional burps of CJK characters, the relevant
> character encoding committees, the UTC and WG2, are managing to
> process approximately 1000 characters per year for new standardization
> as part of 10646 and the Unicode Standard. You do the math and tell
> me at that rate how long it will take to standardize over 700,000
> characters. The Internet and the IETF will be forgotten ancient
> footnotes long before anybody manages to standardize 700,000 more
> characters.
>
> --Ken
>