[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: OT - Re: [idn] URL encoding in html page

To: "Soobok Lee" <lsb@postel.co.kr>,"IETF idn working group" <idn@ops.ietf.org>
Subject: Re: OT - Re: [idn] URL encoding in html page
From: "Mark Davis" <mark@macchiato.com>
Date: Mon, 25 Mar 2002 07:20:34 -0800
References: <OF983EF17D.A231C55F-ON85256B83.007CF373@incentivesystems.com> <5.1.0.14.2.20020321180947.02fd5b00@127.0.0.1> <20020322032033.GC940@spsoft.co.kr> <20020322042622.GA1853@spsoft.co.kr> <038101c1d15c$972f9940$a303a8c0@bruce> <09df01c1d172$b01508d0$2b19fea9@temp> <039b01c1d179$a25b6070$a303a8c0@bruce> <0a3d01c1d17c$6523cf00$2b19fea9@temp> <03b201c1d184$09260760$a303a8c0@bruce> <0a9f01c1d189$00baa040$2b19fea9@temp> <009c01c1d1b4$c6a2efd0$8900a8c0@c1340594a> <0c0001c1d1bc$feb9f0a0$2b19fea9@temp> <018001c1d1ce$550602d0$8900a8c0@c1340594a> <0d7a01c1d202$fd52bcb0$2b19fea9@temp>
Reply-to: "Mark Davis" <mark@macchiato.com>

It is of course possible that some areas do not accept it, much as the
United States has not accepted the metric system (except for
scientific work, and the important realm of soft-drink bottles). It is
difficult to predict the speed of adoption of any technology, but I
suspect you will be surprised at the situation in 5 or 10 years.

Mark
—————

Γνῶθι σαυτόν — Θαλῆς
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

----- Original Message -----
From: "Soobok Lee" <lsb@postel.co.kr>
To: "Mark Davis" <mark@macchiato.com>; "IETF idn working group"
<idn@ops.ietf.org>
Sent: Friday, March 22, 2002 16:38
Subject: Re: OT - Re: [idn] URL encoding in html page


> Dear Mark,
>
> UNICODE wil get more and more popularity as time goes by.
> But, that does not mean that legacy encodings will disappear or will
be obsoleted by UTF8.
> There are at least 2 reasons why legacy encodings will be forever.
>
>  1.  most legacy codes are standardized by local *governmants* that
are best qualified to
>        find and reflect  local communities's character needs.
>       For example, Korean GOV has been constantly revising its
KSX100? local legacy codes
>         to include new Graphic characters and new rarely-used
Chinese letters , even before
>         UNICODE decided to include them.
>       In other words, legacy codes are under control of  their
language communites. But UNICODE
>         are not, and has its own schedules and principles and
motivations.
>       It may be *politically* impossible for legacy codes to be
obsoleted by UNICODE.
>       Can we imagine  Korean Gov publish its laws and rules
documents in UNICODE, not in KSX100x ?
>
>  2. legacy encodings are already  internationally interoperable in
popular HTML/MIME contents.
>      There is no reason why  KSX100x-encoded homepage owners/message
senders
>        should abandon legacy encodings and make transitions into
UTF8    at the cost of additional
>        space and operational inefficiency    now and even in the
forseeable future.
>
>  I believe UNICODE is now everywhere and  will be everywhere even in
the future. In the same time,
>   UNICODE has provided legacy encodings/codes with more
opportunities to be interoperable with
>     minimum costs.
>
>  Soobok Lee
>
>   ----- Original Message -----
>   From: Mark Davis
>   To: Soobok Lee ; IETF idn working group
>   Sent: Saturday, March 23, 2002 3:21 AM
>   Subject: OT - Re: [idn] URL encoding in html page
>
>
>   From my experience talking with customers in the field, the main
reason that people are not serving up UTF-8 pages is not the
> bandwidth, it is the fact that there are still some browsers out in
the field that do not yet handle it correctly. While they are
> dying off fairly quickly, it is not quite at the point where people
are willing to write them off.
>
>   As far as size goes, it is worthwhile looking at some data
samples. The following are from a page on the Unicode site that is
> translated into different languages, so it has essentially the same
information on each page.
>     Size Page
>               8882 s-chinese.html
>               8946 t-chinese.html
>               9347 esperanto.html
>               9498 maltese.html
>               9739 icelandic.html
>               9833 czech.html
>               9944 welsh.html
>               10064 danish.html
>               10109 swedish.html
>               10127 polish.html
>        Size Page
>               10219 interlingua.html
>               10221 italian.html
>               10297 spanish.html
>               10308 portuguese.html
>               10312 lithuanian.html
>               10329 german.html
>               10376 romanian.html
>               10401 korean.html
>               10506 french.html
>
>
>        Size Page
>               10726 japanese.html
>               10953 hebrew.html
>               11192 arabic.html
>               13292 greek.html
>               13870 russian.html
>               13892 persian.html
>               14549 hindi.html
>               15337 georgian.html
>               15853 deseret.html
>
>
>
>
>   So the best case is about 50% of the worst case. Some of this is
due to the encoding, and some is due to different languages just
> using different numbers of characters. However, when you look at web
pages in general use, the amount of text (in bytes) is really
> swamped by graphics, Javascript, HTML code, and so on. So
fundamentally, even the variations above are not that important in
> practice.
>
>   BTW This is getting way off topic.
>
>   Mark
>   —————
>
>   Γνῶθι σαυτόν — Θαλῆς
>   [For transliteration, see
http://oss.software.ibm.com/cgi-bin/icu/tr]
>
>   http://www.macchiato.com
>
>   ----- Original Message -----
>   From: "Soobok Lee" <lsb@postel.co.kr>
>   To: "Mark Davis" <mark@macchiato.com>; "IETF idn working group"
<idn@ops.ietf.org>
>   Sent: Friday, March 22, 2002 08:16
>   Subject: Re: [idn] URL encoding in html page
>
>
>   >
>   > ----- Original Message -----
>   > From: "Mark Davis" <mark@macchiato.com>
>   > To: "Soobok Lee" <lsb@postel.co.kr>; "IETF idn working group"
<idn@ops.ietf.org>
>   > Sent: Saturday, March 23, 2002 12:18 AM
>   > Subject: Re: [idn] URL encoding in html page
>   >
>   >
>   > > Compliant browsers already have to handle Unicode, since NCRs
(e.g.
>   > > &#x1234; ) are always Unicode code points. All XML parsers
also have
>   > > to handle Unicode (UTF-8 and UTF-16).
>   >
>   > Right, Already.
>   > MS IE and NEtscape  already have been supporting  UNICODE
>   > from serveral year ago, but still most homepages are in legacy
encodings.
>   > MS WORD (already unicode based) have features to produce  (from
>   > unicode-based .doc files)   legacy encoded .html files  for web
publishing
>   >
>   > Korean/Japanese/Chinese texts in UTF8 are 50% bigger than legacy
ones.
>   > 50% more disk space and bandwidth  will be required.
>   > Each Cyrillic alhpabet in legacy code occupy one octet, while in
UTF8,
>   > it requires 3 octets. 200% more space is needed.
>   > I cannot imagine the entire Russians make transition to UTF8.
>   > Legacy encnodings are more space efficient than UNICODE.
>   >
>   > legacy-to-legacy conversions like BIG5->KSX1001 are really being
implemented
>   > as two steps of BIG5->UNICODE  and UNICODE->KSX1001. UNICODE
>   > are actively used  as such intermediate encodings, but still not
be  used and entered
>   > directly by  end users so actively. Rather, UNICODE  may be a
hub to facilitate  interchange
>   > of informations in different legacy encodings or  font sharing
for differently legacy-encoded chars.
>   >
>   > I  regard UNICODE as a substrate (not as a competitor) upon
which legacy encodings are built.
>   >
>   > >
>   > > > Legacy encodings
>   > > > will dominates even in the future, because it is compact and
>   > > > inexpensive.
>   > >
>   > > While I do expect the transition to Unicode to take some time,
once
>   > > some of the older browsers die off it may shift more rapidly
than we
>   > > think.
>   >
>   > I am not UNICODE expert nor character expert. But, everyday, i
feel
>   > the strong inertia toward legacy encodings in our local language
communties.
>   > language-tagging-enabled text format like HTML will lengthen the
lifespan
>   > of legacy encodings by great amounts  and allow legacy-coded
HTML texts
>   > are internationally interchanged without problems.
>   >
>   > Soobok Lee
>   >
>   > >
>   > > Mark
>   > > —————
>   > >
>   > > Γνῶθι σαυτόν — Θαλῆς
>   > > [For transliteration, see
http://oss.software.ibm.com/cgi-bin/icu/tr]
>   > >
>   > > http://www.macchiato.com
>   > >
>   > > ----- Original Message -----
>   > > From: "Soobok Lee" <lsb@postel.co.kr>
>   > > To: "IETF idn working group" <idn@ops.ietf.org>
>   > > Sent: Friday, March 22, 2002 02:04
>   > > Subject: Re: [idn] URL encoding in html page
>   > >
>   > >
>   > > >
>   > > > ----- Original Message -----
>   > > > From: "Bruce Thomson" <bthomson@fm-net.ne.jp>
>   > > > To: "Soobok Lee" <lsb@postel.co.kr>; "IETF idn working
group"
>   > > <idn@ops.ietf.org>
>   > > > Sent: Friday, March 22, 2002 6:29 PM
>   > > > Subject: Re: [idn] URL encoding in html page
>   > > >
>   > > >
>   > > > > > What if all the html viewable text is in english, but,
only the
>   > > href url contains
>   > > > > > legacy (korean) encoded hostnames?  chinese visitors
would see
>   > > clean english homepage,
>   > > > > > but fail to click through the korean link.
>   > > > > >
>   > > > > Well, that could happen, but a META tag would solve that
so
>   > > easily. Personally
>   > > > > I often use a simple text editor to deal with HTML, and
would find
>   > > it easier to
>   > > > > use legacy encodings or UTF-8 than cut-and-paste ACE from
>   > > somewhere.
>   > > > > Of course the user could do it either way and it would
work.
>   > > >
>   > > > Yes. Charset META tags help. But, many homepages  have
assumptions
>   > > on the main audience's
>   > > > default char encodings and very often omit the  META tag for
the
>   > > encoding like :
>   > > >   <meta http-equiv="Content-Type" content="text/html;
>   > > charset=euc-kr">
>   > > >
>   > > > Moreover, IDN url would be used in a pure FRAMESET document
that
>   > > defines frame URLs
>   > > > and contains no viewable texts. Such FRAMESET documents
often omit
>   > > charset META tags.
>   > > >  (look into the html source of http://www.freeway.co.kr/ )
>   > > >
>   > > > AFIAK, 99.99999% of korean homepages have implicit/explicit
>   > > > legacy korean encoding (KS_C_5601-1987 or euc-kr). So do
most
>   > > japanese/chineses homepages.
>   > > > UTF8/UCS-2 encodings are rarely used in global WEB
publishing.
>   > > Legacy encodings
>   > > > will dominates even in the future, because it is compact and
>   > > inexpensive.
>   > > >
>   > > > IF we want to make IDN truly internationally interoperable,
all
>   > > IDN-aware webbrowsers/applications
>   > > > should contain libaries of all kinds of legacy-to-Unicode
conversion
>   > > routines. It will burden
>   > > > too much memory load on handheld devices like PDA.
>   > > >
>   > > > Moreover, legacy encodings are revised separately from
unicode. We
>   > > may face with as toughest
>   > > > versioning problems as we did in stringprep/nameprep
versioning
>   > > problems for newly added unicode points.
>   > > > How to guarantee  stability and intergrity of IDN operations
in the
>   > > all combinations of  numerous kinds and versions of iDN-aware
>   > > > applications and legacy encodings?
>   > > >
>   > > > Soobok Lee
>   > > >
>   > > >
>   > > >
>   >
>   >
>   >
>

References:
- Re: [idn] Moving Towards UTF8 vs ASCII(ACE) Forever
  - From: "John Stracke" <jstracke@incentivesystems.com>
- Re: [idn] Moving Towards UTF8 vs ASCII(ACE) Forever
  - From: Dave Crocker <dhc@dcrocker.net>
- Re: [idn] Moving Towards UTF8 vs ASCII(ACE) Forever
  - From: YangWoo Ko <newcat@spsoft.co.kr>
- Re: [idn] Moving Towards UTF8 vs ASCII(ACE) Forever
  - From: YangWoo Ko <newcat@spsoft.co.kr>
- Re: [idn] Moving Towards UTF8 vs ASCII(ACE) Forever
  - From: "Bruce Thomson" <bthomson@fm-net.ne.jp>
- [idn] URL encoding in html page
  - From: "Soobok Lee" <lsb@postel.co.kr>
- Re: [idn] URL encoding in html page
  - From: "Bruce Thomson" <bthomson@fm-net.ne.jp>
- Re: [idn] URL encoding in html page
  - From: "Soobok Lee" <lsb@postel.co.kr>
- Re: [idn] URL encoding in html page
  - From: "Bruce Thomson" <bthomson@fm-net.ne.jp>
- Re: [idn] URL encoding in html page
  - From: "Soobok Lee" <lsb@postel.co.kr>
- Re: [idn] URL encoding in html page
  - From: "Mark Davis" <mark@macchiato.com>
- Re: [idn] URL encoding in html page
  - From: "Soobok Lee" <lsb@postel.co.kr>
- OT - Re: [idn] URL encoding in html page
  - From: "Mark Davis" <mark@macchiato.com>
- Re: OT - Re: [idn] URL encoding in html page
  - From: "Soobok Lee" <lsb@postel.co.kr>

Prev by Date: Re: [idn] URL encoding in html page
Next by Date: Re: [idn] Re: 7 bits forever!
Previous by thread: Re: OT - Re: [idn] URL encoding in html page
Next by thread: Re: [idn] URL encoding in html page
Index(es):
- Date
- Thread