[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] URL encoding in html page



On Fri, 29 Mar 2002, James Seng wrote:

>> So you mean there will be more characters that we will use than IP
>> addresses in the world that everyone knows is running out with IPv4...
>
>The lesson on IPv4/6 is once upon a time, someone sclaim 32 bits is all we
>need for IP.

That's a bit different -- nobody's supposed to remember all of the IP's
out there. Someone *is*, when it comes to characters. Besides, not too
many characters are out there, considering that the largest group, Chinese
ideographs, has largely been encoded in Unicode already. (True, there are
esoteric ones. Now, after Unicode 3.1 with its CJK Unified Ideographs
Extension B, that group is rapidly shrinking, too.)

>There are about 500+ han ideograph used in Singapore alone that is not
>inside ISO10646. Of which only 25 is in IRG review. Most of them cannot
>even be justified under normal process because it is character "created"
>for some person name.

Then we might argue that such characters are analogous to corporate logos,
and *shouldn't* in fact be encoded. Not before the glyphs are in
widespread use and some sort of semantic distinction has arisen between
them and their progenitors.

>Fortune Teller: "Your fire element is too strong and your water element
>is lacking. Your bad luck comes from the unbalance Yi-Ching. Lets add
>some water, 3 dot (water kangxi) to this character of your name". Bingo!
>A new character is born.

No. A new *glyph* for the same *character* is born. If the meaning does
not change, I would view this as a glyph variant. From there on, it's the
person's responsibility to lobby for the inclusion of this character in
fonts as an alternative representation. No need for standards
organizations to get involved...

>Some software support UTF-8. No software support Nameprep-UTF-8.
>Software that deals with IDN using UTF-8 still have to upgrade. But lets
>not jump ahead. I'll wait for your draft.

I too think going directly to UTF-8 would be more elegant than using ACE.
However, it is not a panacea, either -- AFAICS, all of the problems with
ACE and ideographs comes from the way these characters are used. Not ACE.
So this trouble won't be going *anywhere* if we jump to UTF-8. The only
thing one would be buying with such a move would be elegance and, perhaps,
unlimited host name lengths.

Sampo Syreeni, aka decoy - mailto:decoy@iki.fi, tel:+358-50-5756111
student/math+cs/helsinki university, http://www.iki.fi/~decoy/front
openpgp: 050985C2/025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2