[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] URL encoding in html page




----- Original Message -----
From: "Mark Davis" <mark@macchiato.com>
To: "Soobok Lee" <lsb@postel.co.kr>; "IETF idn working group" <idn@ops.ietf.org>
Sent: Saturday, March 23, 2002 12:18 AM
Subject: Re: [idn] URL encoding in html page


> Compliant browsers already have to handle Unicode, since NCRs (e.g.
> &#x1234; ) are always Unicode code points. All XML parsers also have
> to handle Unicode (UTF-8 and UTF-16).

Right, Already.
MS IE and NEtscape  already have been supporting  UNICODE
from serveral year ago, but still most homepages are in legacy encodings.
MS WORD (already unicode based) have features to produce  (from
unicode-based .doc files)   legacy encoded .html files  for web publishing

Korean/Japanese/Chinese texts in UTF8 are 50% bigger than legacy ones.
50% more disk space and bandwidth  will be required.
Each Cyrillic alhpabet in legacy code occupy one octet, while in UTF8,
it requires 3 octets. 200% more space is needed.
I cannot imagine the entire Russians make transition to UTF8.
Legacy encnodings are more space efficient than UNICODE.

legacy-to-legacy conversions like BIG5->KSX1001 are really being implemented
as two steps of BIG5->UNICODE  and UNICODE->KSX1001. UNICODE
are actively used  as such intermediate encodings, but still not  be  used and entered
directly by  end users so actively. Rather, UNICODE  may be a hub to facilitate  interchange
of informations in different legacy encodings or  font sharing for differently legacy-encoded chars.

I  regard UNICODE as a substrate (not as a competitor) upon which legacy encodings are built.

>
> > Legacy encodings
> > will dominates even in the future, because it is compact and
> > inexpensive.
>
> While I do expect the transition to Unicode to take some time, once
> some of the older browsers die off it may shift more rapidly than we
> think.

I am not UNICODE expert nor character expert. But, everyday, i  feel
the strong inertia toward legacy encodings in our local language communties.
language-tagging-enabled text format like HTML will lengthen the lifespan
of legacy encodings by great amounts  and allow legacy-coded HTML texts
are internationally interchanged without problems.

Soobok Lee

>
> Mark
> —————
>
> Γνῶθι σαυτόν — Θαλῆς
> [For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]
>
> http://www.macchiato.com
>
> ----- Original Message -----
> From: "Soobok Lee" <lsb@postel.co.kr>
> To: "IETF idn working group" <idn@ops.ietf.org>
> Sent: Friday, March 22, 2002 02:04
> Subject: Re: [idn] URL encoding in html page
>
>
> >
> > ----- Original Message -----
> > From: "Bruce Thomson" <bthomson@fm-net.ne.jp>
> > To: "Soobok Lee" <lsb@postel.co.kr>; "IETF idn working group"
> <idn@ops.ietf.org>
> > Sent: Friday, March 22, 2002 6:29 PM
> > Subject: Re: [idn] URL encoding in html page
> >
> >
> > > > What if all the html viewable text is in english, but, only the
> href url contains
> > > > legacy (korean) encoded hostnames?  chinese visitors would see
> clean english homepage,
> > > > but fail to click through the korean link.
> > > >
> > > Well, that could happen, but a META tag would solve that so
> easily. Personally
> > > I often use a simple text editor to deal with HTML, and would find
> it easier to
> > > use legacy encodings or UTF-8 than cut-and-paste ACE from
> somewhere.
> > > Of course the user could do it either way and it would work.
> >
> > Yes. Charset META tags help. But, many homepages  have assumptions
> on the main audience's
> > default char encodings and very often omit the  META tag for the
> encoding like :
> >   <meta http-equiv="Content-Type" content="text/html;
> charset=euc-kr">
> >
> > Moreover, IDN url would be used in a pure FRAMESET document that
> defines frame URLs
> > and contains no viewable texts. Such FRAMESET documents often omit
> charset META tags.
> >  (look into the html source of http://www.freeway.co.kr/ )
> >
> > AFIAK, 99.99999% of korean homepages have implicit/explicit
> > legacy korean encoding (KS_C_5601-1987 or euc-kr). So do most
> japanese/chineses homepages.
> > UTF8/UCS-2 encodings are rarely used in global WEB publishing.
> Legacy encodings
> > will dominates even in the future, because it is compact and
> inexpensive.
> >
> > IF we want to make IDN truly internationally interoperable, all
> IDN-aware webbrowsers/applications
> > should contain libaries of all kinds of legacy-to-Unicode conversion
> routines. It will burden
> > too much memory load on handheld devices like PDA.
> >
> > Moreover, legacy encodings are revised separately from unicode. We
> may face with as toughest
> > versioning problems as we did in stringprep/nameprep versioning
> problems for newly added unicode points.
> > How to guarantee  stability and intergrity of IDN operations in the
> all combinations of  numerous kinds and versions of iDN-aware
> > applications and legacy encodings?
> >
> > Soobok Lee
> >
> >
> >