[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: OT - Re: [idn] URL encoding in html page



It is of course possible that some areas do not accept it, much as the
United States has not accepted the metric system (except for
scientific work, and the important realm of soft-drink bottles). It is
difficult to predict the speed of adoption of any technology, but I
suspect you will be surprised at the situation in 5 or 10 years.

Mark
—————

Γνῶθι σαυτόν — Θαλῆς
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

----- Original Message -----
From: "Soobok Lee" <lsb@postel.co.kr>
To: "Mark Davis" <mark@macchiato.com>; "IETF idn working group"
<idn@ops.ietf.org>
Sent: Friday, March 22, 2002 16:38
Subject: Re: OT - Re: [idn] URL encoding in html page


> Dear Mark,
>
> UNICODE wil get more and more popularity as time goes by.
> But, that does not mean that legacy encodings will disappear or will
be obsoleted by UTF8.
> There are at least 2 reasons why legacy encodings will be forever.
>
>  1.  most legacy codes are standardized by local *governmants* that
are best qualified to
>        find and reflect  local communities's character needs.
>       For example, Korean GOV has been constantly revising its
KSX100? local legacy codes
>         to include new Graphic characters and new rarely-used
Chinese letters , even before
>         UNICODE decided to include them.
>       In other words, legacy codes are under control of  their
language communites. But UNICODE
>         are not, and has its own schedules and principles and
motivations.
>       It may be *politically* impossible for legacy codes to be
obsoleted by UNICODE.
>       Can we imagine  Korean Gov publish its laws and rules
documents in UNICODE, not in KSX100x ?
>
>  2. legacy encodings are already  internationally interoperable in
popular HTML/MIME contents.
>      There is no reason why  KSX100x-encoded homepage owners/message
senders
>        should abandon legacy encodings and make transitions into
UTF8    at the cost of additional
>        space and operational inefficiency    now and even in the
forseeable future.
>
>  I believe UNICODE is now everywhere and  will be everywhere even in
the future. In the same time,
>   UNICODE has provided legacy encodings/codes with more
opportunities to be interoperable with
>     minimum costs.
>
>  Soobok Lee
>
>   ----- Original Message -----
>   From: Mark Davis
>   To: Soobok Lee ; IETF idn working group
>   Sent: Saturday, March 23, 2002 3:21 AM
>   Subject: OT - Re: [idn] URL encoding in html page
>
>
>   From my experience talking with customers in the field, the main
reason that people are not serving up UTF-8 pages is not the
> bandwidth, it is the fact that there are still some browsers out in
the field that do not yet handle it correctly. While they are
> dying off fairly quickly, it is not quite at the point where people
are willing to write them off.
>
>   As far as size goes, it is worthwhile looking at some data
samples. The following are from a page on the Unicode site that is
> translated into different languages, so it has essentially the same
information on each page.
>     Size Page
>               8882 s-chinese.html
>               8946 t-chinese.html
>               9347 esperanto.html
>               9498 maltese.html
>               9739 icelandic.html
>               9833 czech.html
>               9944 welsh.html
>               10064 danish.html
>               10109 swedish.html
>               10127 polish.html
>        Size Page
>               10219 interlingua.html
>               10221 italian.html
>               10297 spanish.html
>               10308 portuguese.html
>               10312 lithuanian.html
>               10329 german.html
>               10376 romanian.html
>               10401 korean.html
>               10506 french.html
>
>
>        Size Page
>               10726 japanese.html
>               10953 hebrew.html
>               11192 arabic.html
>               13292 greek.html
>               13870 russian.html
>               13892 persian.html
>               14549 hindi.html
>               15337 georgian.html
>               15853 deseret.html
>
>
>
>
>   So the best case is about 50% of the worst case. Some of this is
due to the encoding, and some is due to different languages just
> using different numbers of characters. However, when you look at web
pages in general use, the amount of text (in bytes) is really
> swamped by graphics, Javascript, HTML code, and so on. So
fundamentally, even the variations above are not that important in
> practice.
>
>   BTW This is getting way off topic.
>
>   Mark
>   —————
>
>   Γνῶθι σαυτόν — Θαλῆς
>   [For transliteration, see
http://oss.software.ibm.com/cgi-bin/icu/tr]
>
>   http://www.macchiato.com
>
>   ----- Original Message -----
>   From: "Soobok Lee" <lsb@postel.co.kr>
>   To: "Mark Davis" <mark@macchiato.com>; "IETF idn working group"
<idn@ops.ietf.org>
>   Sent: Friday, March 22, 2002 08:16
>   Subject: Re: [idn] URL encoding in html page
>
>
>   >
>   > ----- Original Message -----
>   > From: "Mark Davis" <mark@macchiato.com>
>   > To: "Soobok Lee" <lsb@postel.co.kr>; "IETF idn working group"
<idn@ops.ietf.org>
>   > Sent: Saturday, March 23, 2002 12:18 AM
>   > Subject: Re: [idn] URL encoding in html page
>   >
>   >
>   > > Compliant browsers already have to handle Unicode, since NCRs
(e.g.
>   > > &#x1234; ) are always Unicode code points. All XML parsers
also have
>   > > to handle Unicode (UTF-8 and UTF-16).
>   >
>   > Right, Already.
>   > MS IE and NEtscape  already have been supporting  UNICODE
>   > from serveral year ago, but still most homepages are in legacy
encodings.
>   > MS WORD (already unicode based) have features to produce  (from
>   > unicode-based .doc files)   legacy encoded .html files  for web
publishing
>   >
>   > Korean/Japanese/Chinese texts in UTF8 are 50% bigger than legacy
ones.
>   > 50% more disk space and bandwidth  will be required.
>   > Each Cyrillic alhpabet in legacy code occupy one octet, while in
UTF8,
>   > it requires 3 octets. 200% more space is needed.
>   > I cannot imagine the entire Russians make transition to UTF8.
>   > Legacy encnodings are more space efficient than UNICODE.
>   >
>   > legacy-to-legacy conversions like BIG5->KSX1001 are really being
implemented
>   > as two steps of BIG5->UNICODE  and UNICODE->KSX1001. UNICODE
>   > are actively used  as such intermediate encodings, but still not
be  used and entered
>   > directly by  end users so actively. Rather, UNICODE  may be a
hub to facilitate  interchange
>   > of informations in different legacy encodings or  font sharing
for differently legacy-encoded chars.
>   >
>   > I  regard UNICODE as a substrate (not as a competitor) upon
which legacy encodings are built.
>   >
>   > >
>   > > > Legacy encodings
>   > > > will dominates even in the future, because it is compact and
>   > > > inexpensive.
>   > >
>   > > While I do expect the transition to Unicode to take some time,
once
>   > > some of the older browsers die off it may shift more rapidly
than we
>   > > think.
>   >
>   > I am not UNICODE expert nor character expert. But, everyday, i
feel
>   > the strong inertia toward legacy encodings in our local language
communties.
>   > language-tagging-enabled text format like HTML will lengthen the
lifespan
>   > of legacy encodings by great amounts  and allow legacy-coded
HTML texts
>   > are internationally interchanged without problems.
>   >
>   > Soobok Lee
>   >
>   > >
>   > > Mark
>   > > —————
>   > >
>   > > Γνῶθι σαυτόν — Θαλῆς
>   > > [For transliteration, see
http://oss.software.ibm.com/cgi-bin/icu/tr]
>   > >
>   > > http://www.macchiato.com
>   > >
>   > > ----- Original Message -----
>   > > From: "Soobok Lee" <lsb@postel.co.kr>
>   > > To: "IETF idn working group" <idn@ops.ietf.org>
>   > > Sent: Friday, March 22, 2002 02:04
>   > > Subject: Re: [idn] URL encoding in html page
>   > >
>   > >
>   > > >
>   > > > ----- Original Message -----
>   > > > From: "Bruce Thomson" <bthomson@fm-net.ne.jp>
>   > > > To: "Soobok Lee" <lsb@postel.co.kr>; "IETF idn working
group"
>   > > <idn@ops.ietf.org>
>   > > > Sent: Friday, March 22, 2002 6:29 PM
>   > > > Subject: Re: [idn] URL encoding in html page
>   > > >
>   > > >
>   > > > > > What if all the html viewable text is in english, but,
only the
>   > > href url contains
>   > > > > > legacy (korean) encoded hostnames?  chinese visitors
would see
>   > > clean english homepage,
>   > > > > > but fail to click through the korean link.
>   > > > > >
>   > > > > Well, that could happen, but a META tag would solve that
so
>   > > easily. Personally
>   > > > > I often use a simple text editor to deal with HTML, and
would find
>   > > it easier to
>   > > > > use legacy encodings or UTF-8 than cut-and-paste ACE from
>   > > somewhere.
>   > > > > Of course the user could do it either way and it would
work.
>   > > >
>   > > > Yes. Charset META tags help. But, many homepages  have
assumptions
>   > > on the main audience's
>   > > > default char encodings and very often omit the  META tag for
the
>   > > encoding like :
>   > > >   <meta http-equiv="Content-Type" content="text/html;
>   > > charset=euc-kr">
>   > > >
>   > > > Moreover, IDN url would be used in a pure FRAMESET document
that
>   > > defines frame URLs
>   > > > and contains no viewable texts. Such FRAMESET documents
often omit
>   > > charset META tags.
>   > > >  (look into the html source of http://www.freeway.co.kr/ )
>   > > >
>   > > > AFIAK, 99.99999% of korean homepages have implicit/explicit
>   > > > legacy korean encoding (KS_C_5601-1987 or euc-kr). So do
most
>   > > japanese/chineses homepages.
>   > > > UTF8/UCS-2 encodings are rarely used in global WEB
publishing.
>   > > Legacy encodings
>   > > > will dominates even in the future, because it is compact and
>   > > inexpensive.
>   > > >
>   > > > IF we want to make IDN truly internationally interoperable,
all
>   > > IDN-aware webbrowsers/applications
>   > > > should contain libaries of all kinds of legacy-to-Unicode
conversion
>   > > routines. It will burden
>   > > > too much memory load on handheld devices like PDA.
>   > > >
>   > > > Moreover, legacy encodings are revised separately from
unicode. We
>   > > may face with as toughest
>   > > > versioning problems as we did in stringprep/nameprep
versioning
>   > > problems for newly added unicode points.
>   > > > How to guarantee  stability and intergrity of IDN operations
in the
>   > > all combinations of  numerous kinds and versions of iDN-aware
>   > > > applications and legacy encodings?
>   > > >
>   > > > Soobok Lee
>   > > >
>   > > >
>   > > >
>   >
>   >
>   >
>