[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: An argument against multiple character sets



Hello;

I'm assuming you are using "character set" interchangeably with "encoding"
below... 

At 12:01 PM 1/23/00 -0800, Paul Hoffman / IMC wrote:
>There has been some discussion on this list about whether or not we should 
>allow domain names to be created in different character sets. I believe 
>that there is a simple argument that shows that we can't.
>
>Let's say I want to register a domain name that is two letters: LATIN SMALL 
>LETTER F followed by LATIN SMALL LETTER U WITH OGONEK. If I use ISO 8859-4, 
>that would encoded as 0x46F9. So far so good. You see a billboard with my 
>domain name on it, and you enter it into a browser. That browser uses a 
>different character set, let's say Unicode. The browser sends to the 
>resolver 0x00460173.
>
>There are two problems here:
>- The browser *can't* know every possible character set
>- Even if it did, it wouldn't know which one to use

Exactly! <smile>

That's why Microsoft has adopted UTF-8 (UNICODE) as its "standard" default
configuration, both in IE5 and in Windows 2000 DNS. And once Netscape
adopts the same default "standard", both browsers will only (or, primarily)
send UTF-8 queries to the resolver. Our customers tell us other browsers
(such as Opera) can also resolve our test UTF-8 test URLs

How to best modify BIND in order for it to be able to deal with all this is
probably much more important to this discussion than deciding which
encoding should be set as the standard, IMO. UTF-8 looks pretty "standard"
already, from the client/user point of view, at least. 

I'm not saying I think this "unofficial" working group should just bless
UTF-8. I'm saying the more important work is in developing standards for
upgrading BIND.


-- Bill Semich
.NU Domain

>
>Adding a charset tag to the internationalized string in the domain name 
>doesn't help. There is no way for someone seeing a printed representation 
>of the internationalized string to know which character set was used; in 
>this case it could be 8859-4 or Unicode or possibly other character sets 
>that contain that character.
>
>Even requiring all resolvers to do the conversion doesn't help unless we 
>list all the possible character sets and never change the list. This 
>introduces many problems:
>- New character sets can't be added later without simultaneously updating 
>all the resolvers on the Internet to use the added character sets. Such 
>simultaneous updates are impossible.
>- The main reason we are considering more than one character set now is 
>current politics and desires for favored character sets. We can safely 
>assume that politics and desires will continue to change and evolve.
>- We are forcing resolvers to do much more processing than they are now.
>
>In short, I don't see how a solution that allows more than one character 
>set, or even more than one encoding, will work. If others have 
>counter-examples, I'm open to hearing them.
>
>--Paul Hoffman, Director
>--Internet Mail Consortium
>
>
>
Bill Semich
President and Founder
.NU Domain Ltd
http://whats.nu
bill@mail.nic.nu