[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] Domain names and ASCII compatibility



hi

A lot of talk about what is a domain name and what is a host name,
as well as about ASCII compatibility. Here are some definitions and
questions for you.

Domain name:
If you go back to the original DNS RFCs (1034, 1035) you will see
that a domain name is a sequence of labels giving under which
objects (records) can be defined. The domain name is the name 
identifying something in the DNS. Each label in current DNS
can be of a maximum of 63 octets. There is no restriction on
8-bit values in a label.

A domain name can be used for many things, two you often have met
is host name and e-mail address. Examples are:
  Host name: orion.world.net
  E-mail:    kent\.xson.mail.net (same as kent.xson@mail.net).
Note that the e-mail address can contain a dot (.).

So what is allowed in a domain name depends on its use.

We are defining internationalisation of domain names, not just
host names. So it should be possible to have names like:
   Host name: www.gås.net
   E-mail:    kåre.åkesson@gås.net
   

As IETF now have an RFC stating that UTF-8 shall be supported in
protocols. UTF-8 is the natural candidate for encoding domain names
in the DNS protocol.

But what about compatibility with older systems that might
reject hosts names with non-ASCII in them or e-mail systems
rejecting non-ASCII in e-mail addresses?
I guess we unfortunately have to suppport them in
some way.
This is where all the talk about quoted-printable mappings, CIDNUC,
UTF-5 and %-encodings in URLS come in.
If old software just treated non-ASCII as "letters" like som
software have done, we could just use UTF-8. But it looks like
some old software does not work that way.
So what is the best way to have an ASCII compatible encoding?
As Kent has pointed out, quoted-printable in e-mail is a terrible thing.
Unfortunately, these ASCII compatibility solutions very often
end up being different for every type of name. Today we have:
  in E-mail: quoted-printable
  in ULRs:   %-encoding
And for domain names we will one more encoding!
Using UTF-8 everywere will make it much more probable that we do
not see encoded text everywhere. It is also simpler to handle in
software. Just think: decoding e-mail - you have to handle decoding
of encoded domain names and e-mail adresses in headers and text,
quoted-printable MIME in headers, quoted-printable, encoded
domain names and %-encoded URLs in the body. All with different encodings.
What a mess. Can we really expect the program to display the e-mail
for the user using the users local character set? The user must not
be shown all the encoded text.

If we do an ASCII compatible encoding of domain names, we could do:
1) something like quoted-printable
2) something totally opaque.
(Note: I will here give examples for host/e-mail compatible names as
 they are probably most important).
 

Examples:
   Assume domain name: www.gås.net (gås has an a with ring above in the middle)
   
   1): www.g-c3a5s-.net (using the mapping I suggested in my draft)
   2): www.8wahdfhud.net
       or when encoding all labels: b5a2d2d2d.8wahdfhud.i3aodfdvd
       
   E-mail: kåre.åkesson@gås.net
   1): k-c3a5re.-c3a5kesson@g-c3a5s-.net
   2): fh9ldfhtdfdobfhldfdududpdod@8wahdfhud.net
       
What is best?
quoted-printable like mapping does allow ASCII characters to show
as ASCII characters. Though for something like Japanese everything
will be encoded.
The opaque encodes all character including the ASCII characters,
so it will be even more visible. Even ASCII only names can be
encoded so even ASCII only users can see how it looks.
And how are we going to get an e-mail displayed correctely
(that is without encoded text) for the user?

    Dan