[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Prohibit CDN code points



There are, of course, problems in Unicode; unavoidable with a project
that
complex. Some were introduced for compatibility with the host of legacy
code
pages in the world (by our count in ICU, well over 700 unique code pages
in
current use), others could have been avoided had we known what we know
now.

Unification is not one of them, however. It is not so much a feature of
Unicode as a feature of human writing. We could have chosen to deunify
every
language's character; even every dialect's. A Swedish 'a' would be
different
than a French 'a', different than an English 'a', even different than a
Yorkshire 'a' or a New York 'a'. After all, that would allow one to
detect
different languages, and sort or match differently based upon that. For
that
matter, we could have chosen to deunify fonts and styles; bold 'u' from
italic, Helvetica from Times New Roman.

But what a huge mess it would be. Rather than a relatively small number
of
confusible characters, we would have essentially all confusible
characters.
Applications would have to deal with a tremendous increase in the number
of
characters, drastically increasing the memory storage for the necessary
character properties, and there would be an incredible number of
problems
for users because of the visual confusion of so many characters. Much
too
high a price, on balance, rather than dealing with matching issues in a
simpler representation.

The TC/SC problem can be dealt with by means of the registration of a
small
number of additional names. Little different in kind from registering
both
theatre.com and theater.com, or aarborg.com and a<ring>rborg.com. While
theoretically someone could have 5-character name with 32 combinations
of
TC/SC, in practice nobody has proven this to be a real problem, or even
provided any evidence whatsoever that clients will in fact be confused.

Mark
—————

 όλλ’  πίστατο ἔργα, κακῶς δ’ 
πίστατο πάντα — Ὁμήρου Μαργίτῃ
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

----- Original Message -----
From: "Patrik Fältström" <paf@cisco.com>
To: "YangWoo Ko" <newcat@spsoft.co.kr>; "IETF-IDN" <idn@ops.ietf.org>
Sent: Wednesday, January 23, 2002 05:18
Subject: Re: [idn] Prohibit CDN code points


> --On 2002-01-23 21.47 +0900 YangWoo Ko <newcat@spsoft.co.kr> wrote:
>
> > Your last statement does not exactly describe TC/SC issue. Following
may
> > explain TC/SC issue better;
> >
> > "If one enter a string in Unicode, one may or may not know whether
TC or
> > SC was used. It depends both on his language in mind when entering
that
> > string and on his knowledge about characters."
>
> Correct.
>
> > Dear all members,
> >
> > What about having additional prefix(es) for extension like TC/SC
issue ?
> > For example, az-- for normal IDNA and bz-- for chinese-extension
IDNA
> > and so forth. It may serve as an context information or language
tag.
>
> How do you match between one string which uses az--<foo>.com and
> bz--<foo>.com where "<foo>" stands for the term "foo" but encoded?
>
> And, yes, you can do this, but as i have pointed out before, this
means
> every server needs to know about all matching algorithms.
>
> I.e. if you open the box of "problems" with Unicode, you will find
that
the
> SC/TC problem is only one of them. Only one. I guess we have some
20-30
> other problems which are similar to the SC/TC, i.e. problems because
of
> unification or non-unification in Unicode.
>
> So, you will see an explosion of matching rules.
>
> Reason we see this about SC/TC is that that happen to be the problem
space
> we discuss at the moment. We could aswell discuss the problems with
> adiaeresis in the countries and languages which uses it.
>
> My conclusion is the same, every server need to have knowledge about
how
to
> handle all encodings.
>
>    paf
>
>
>