Re: OT - Re: [idn] URL encoding in html page

Dear Mark,

UNICODE wil get more and more popularity as time goes by.

But, that does not mean that legacy encodings will disappear or will be obsoleted by UTF8.

There are at least 2 reasons why legacy encodings will be forever.

1. most legacy codes are standardized by local *governmants* that are best qualified to

find and reflect local communities's character needs.

For example, Korean GOV has been constantly revising its KSX100? local legacy codes

to include new Graphic characters and new rarely-used Chinese letters , even before

UNICODE decided to include them.

In other words, legacy codes are under control of their language communites. But UNICODE

are not, and has its own schedules and principles and motivations.

It may be *politically* impossible for legacy codes to be obsoleted by UNICODE.

Can we imagine Korean Gov publish its laws and rules documents in UNICODE, not in KSX100x ?

2. legacy encodings are already internationally interoperable in popular HTML/MIME contents.

There is no reason why KSX100x-encoded homepage owners/message senders

should abandon legacy encodings and make transitions into UTF8 at the cost of additional

space and operational inefficiency now and even in the forseeable future.

I believe UNICODE is now everywhere and will be everywhere even in the future. In the same time,

UNICODE has provided legacy encodings/codes with more opportunities to be interoperable with

minimum costs.

Soobok Lee

----- Original Message -----

From: Mark Davis

To: Soobok Lee ; IETF idn working group

Sent: Saturday, March 23, 2002 3:21 AM

Subject: OT - Re: [idn] URL encoding in html page

From my experience talking with customers in the field, the main reason that people are not serving up UTF-8 pages is not the bandwidth, it is the fact that there are still some browsers out in the field that do not yet handle it correctly. While they are dying off fairly quickly, it is not quite at the point where people are willing to write them off.

As far as size goes, it is worthwhile looking at some data samples. The following are from a page on the Unicode site that is translated into different languages, so it has essentially the same information on each page.

Size Page

8882 s-chinese.html

8946 t-chinese.html

9347 esperanto.html

9498 maltese.html

9739 icelandic.html

9833 czech.html

9944 welsh.html

10064 danish.html

10109 swedish.html

10127 polish.html

Size Page

10219 interlingua.html

10221 italian.html

10297 spanish.html

10308 portuguese.html

10312 lithuanian.html

10329 german.html

10376 romanian.html

10401 korean.html

10506 french.html

Size Page

10726 japanese.html

10953 hebrew.html

11192 arabic.html

13292 greek.html

13870 russian.html

13892 persian.html

14549 hindi.html

15337 georgian.html

15853 deseret.html

So the best case is about 50% of the worst case. Some of this is due to the encoding, and some is due to different languages just using different numbers of characters. However, when you look at web pages in general use, the amount of text (in bytes) is really swamped by graphics, Javascript, HTML code, and so on. So fundamentally, even the variations above are not that important in practice.

BTW This is getting way off topic.

Mark

—————

Γνῶθι σαυτόν — Θαλῆς
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

----- Original Message -----
From: "Soobok Lee" <lsb@postel.co.kr>

To: "Mark Davis" <mark@macchiato.com>; "IETF idn working group" <idn@ops.ietf.org>

Sent: Friday, March 22, 2002 08:16

Subject: Re: [idn] URL encoding in html page

>
> ----- Original Message -----
> From: "Mark Davis" <mark@macchiato.com>
> To: "Soobok Lee" <lsb@postel.co.kr>; "IETF idn working group" <idn@ops.ietf.org>
> Sent: Saturday, March 23, 2002 12:18 AM
> Subject: Re: [idn] URL encoding in html page
>
>
> > Compliant browsers already have to handle Unicode, since NCRs (e.g.
> > ሴ ) are always Unicode code points. All XML parsers also have
> > to handle Unicode (UTF-8 and UTF-16).
>
> Right, Already.
> MS IE and NEtscape already have been supporting UNICODE
> from serveral year ago, but still most homepages are in legacy encodings.
> MS WORD (already unicode based) have features to produce (from
> unicode-based .doc files) legacy encoded .html files for web publishing
>
> Korean/Japanese/Chinese texts in UTF8 are 50% bigger than legacy ones.
> 50% more disk space and bandwidth will be required.
> Each Cyrillic alhpabet in legacy code occupy one octet, while in UTF8,
> it requires 3 octets. 200% more space is needed.
> I cannot imagine the entire Russians make transition to UTF8.
> Legacy encnodings are more space efficient than UNICODE.
>
> legacy-to-legacy conversions like BIG5->KSX1001 are really being implemented
> as two steps of BIG5->UNICODE and UNICODE->KSX1001. UNICODE
> are actively used as such intermediate encodings, but still not be used and entered
> directly by end users so actively. Rather, UNICODE may be a hub to facilitate interchange
> of informations in different legacy encodings or font sharing for differently legacy-encoded chars.
>
> I regard UNICODE as a substrate (not as a competitor) upon which legacy encodings are built.
>
> >
> > > Legacy encodings
> > > will dominates even in the future, because it is compact and
> > > inexpensive.
> >
> > While I do expect the transition to Unicode to take some time, once
> > some of the older browsers die off it may shift more rapidly than we
> > think.
>
> I am not UNICODE expert nor character expert. But, everyday, i feel
> the strong inertia toward legacy encodings in our local language communties.
> language-tagging-enabled text format like HTML will lengthen the lifespan
> of legacy encodings by great amounts and allow legacy-coded HTML texts
> are internationally interchanged without problems.
>
> Soobok Lee
>
> >
> > Mark
> > —————
> >
> > Γνῶθι σαυτόν — Θαλῆς
> > [For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]
> >
> > http://www.macchiato.com
> >
> > ----- Original Message -----
> > From: "Soobok Lee" <lsb@postel.co.kr>
> > To: "IETF idn working group" <idn@ops.ietf.org>
> > Sent: Friday, March 22, 2002 02:04
> > Subject: Re: [idn] URL encoding in html page
> >
> >
> > >
> > > ----- Original Message -----
> > > From: "Bruce Thomson" <bthomson@fm-net.ne.jp>
> > > To: "Soobok Lee" <lsb@postel.co.kr>; "IETF idn working group"
> > <idn@ops.ietf.org>
> > > Sent: Friday, March 22, 2002 6:29 PM
> > > Subject: Re: [idn] URL encoding in html page
> > >
> > >
> > > > > What if all the html viewable text is in english, but, only the
> > href url contains
> > > > > legacy (korean) encoded hostnames? chinese visitors would see
> > clean english homepage,
> > > > > but fail to click through the korean link.
> > > > >
> > > > Well, that could happen, but a META tag would solve that so
> > easily. Personally
> > > > I often use a simple text editor to deal with HTML, and would find
> > it easier to
> > > > use legacy encodings or UTF-8 than cut-and-paste ACE from
> > somewhere.
> > > > Of course the user could do it either way and it would work.
> > >
> > > Yes. Charset META tags help. But, many homepages have assumptions
> > on the main audience's
> > > default char encodings and very often omit the META tag for the
> > encoding like :
> > > <meta http-equiv="Content-Type" content="text/html;
> > charset=euc-kr">
> > >
> > > Moreover, IDN url would be used in a pure FRAMESET document that
> > defines frame URLs
> > > and contains no viewable texts. Such FRAMESET documents often omit
> > charset META tags.
> > > (look into the html source of http://www.freeway.co.kr/ )
> > >
> > > AFIAK, 99.99999% of korean homepages have implicit/explicit
> > > legacy korean encoding (KS_C_5601-1987 or euc-kr). So do most
> > japanese/chineses homepages.
> > > UTF8/UCS-2 encodings are rarely used in global WEB publishing.
> > Legacy encodings
> > > will dominates even in the future, because it is compact and
> > inexpensive.
> > >
> > > IF we want to make IDN truly internationally interoperable, all
> > IDN-aware webbrowsers/applications
> > > should contain libaries of all kinds of legacy-to-Unicode conversion
> > routines. It will burden
> > > too much memory load on handheld devices like PDA.
> > >
> > > Moreover, legacy encodings are revised separately from unicode. We
> > may face with as toughest
> > > versioning problems as we did in stringprep/nameprep versioning
> > problems for newly added unicode points.
> > > How to guarantee stability and intergrity of IDN operations in the
> > all combinations of numerous kinds and versions of iDN-aware
> > > applications and legacy encodings?
> > >
> > > Soobok Lee
> > >
> > >
> > >
>
>
>