[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[idn] Normalisation and ASCII fallbacks
>I suggest ignoring UTR 21. Just downcase according to the 'default'
>(non-normative) in the main property table for Unicode 3.0.
>Correction: Normalisation has to be done (in principle) AFTER a case change
>of any form,
>otherwise the result migth not be in any one of the defined normal forms.
>(I can supply
>a detailed argument if you like.)
I used UTR 21 because I thought it was more correct, but from what you
say maybe it is best to go the simplest way for many programmers. It
might also be best with simple rules.
To do case folding to lower case, just use the main property table
for Unicode 3.0. This has one to one mapping only which is very nice as it
simplifies data handling.
In DNS international text will be used both in domain names and in
text data in record (like in the TXT or HINFO record).
The simplest way would be to say: user normalisation form C everywhere.
This way everything is the same in all places and you have not to worry
about differnt forms att different places.
The next matter is doing a lookup in DNS. Here it would be nice to match
names as a human will want it, ignoring cosmetic differenses.
To just say: use case folding to lower case of the form C data and then
compare, would be the simplest. But might make a lot of names
for many humans being the same, compare as different.
I removed form KC from my draft because Harald Alvestrand said it was
difficult. But maybe that is wrong?
Can you convert from from C to form KC without going over a decomposing
I prefer to say form C everywhere (and in original case), because it
allows cosmetic to be preserved. And that is important for many people.
Still it would be nice to say: when doing a lookup, the data should
be normalised from form C to KC, lower cased (or maybe this before
normalising) and mayby adding additional simplification before
doing the comparison.
But, can this been done easily? And are we ready to define the
matching rules so they need not be changed tomorrow?
There might be one more problem, more systems than DNS may need to
compare domain names. For example a webb browser need to compare
domain names to know if it two URLs point to the same place. With
form KC and other rules all that need to go into every program doing
domain name comparing.
Though for many programs just handling the folding to lower case
would be enough.
In my draft I gave a way to map the internal UTF-8 encoded data
in DNS to an ASCII (or larger subset of UCS) for all characters not
handled by the subset. This was done to to reasons:
1) To allow users to see and enter names not supported by their
local character set.
2) To allow "old DNS" to get ASCII only names that also can work
with old SMTP and other protocols.
Here Kent said:
>1) "The" local character set? Who's local character set? Such things
>are these days usually personal preferences, or rather just personal
>defaults swiftly overridden by "charset=", heuristics, or plain temporary
>change. Setting one non-UCS character encoding as "the" local one for
>entire organisation or the like is highly inappropriate.
The local character set is the one the user is using. If the user
is in locale ASCII, it will be used. If the user is in ISO 8859-15
it will be used. But of course, just like you have today, you have
a problem if a person of one locale gives the name to a person in
an other locale without remapping it to fit that locale.
I did not think that you should set one for the entire
organisation. And I do not think the problem is easy.
While the user using ISO 8859-15 wants domain names displayed using
ISO 8859-15, if possible, when sending e-mail the domain names must
either be in ASCII (using old SMTP) or UTF-8 (using a future SMTP).
Even if we say that a person using ISO 8859-15 maybe does not
need(want) to use a domain name in Korean (they page in Korean and
cannot be displayed) and only need domain names in ISO 8859-15. We
need a way for domain names from DNS to be translated into
ISO 8859-15 and domain names sent in e-mail to be translated to
what is supported by the e-mail system.
>2) Though I'm slightly, but only very slightly, more sympathetic to
>"say the catalogue number (in hex)" type of fallbacks (than CIDNUC-like),
>it should really be the UCS "catalogue number" (like HTML/XML/modern
>SGML does), and NOT be tied to any UTF or other encoding.
My mapping is UCS, though I gave it as a way to encode the UTF-8 encoding
of UCS. Just to make things easy and be a very direct mapping of the
internal format of DNS (in my draft). I could take the code value
of the UCS character instead and then define a way to map that into
ASCII (for example -number in decimal-) but that might take more space
and might make the mapping more complex. I wanted a simple quick
mapping that works both in coding down to ASCII and to other character
sets, and could be used during the transition from ASCII to
UCS in the protocols.
It would have been nice if we could just let the old DNS software
get the UTF-8 encoded names and let the work to handle the non-ASCII
be up to it, but I have the feeling that we have to help the old
software by making DNS return ASCII only names, when old DNS is used.