[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Comments on protocol drafts



At 09:27 00/02/07 -0500, John C Klensin wrote:
> Mangus (and others),
> 
> Since I've been pulled into a formal management role with this 
> proto-WG ("technical advisor", or something),

Glad to have you here. Who 'pulled' you?



> (ii) We believe that some set of languages, or character sets, 
> _deserve_ to be treated very badly by the encoding scheme, just 
> because they came along a little late in the Unicode design 
> process (or for worse reasons).  If you preserve "ASCII is 
> always coded the same way, and that way is always as short as 
> possible" by, e.g., using UTF-8 always, then you are saying, in 
> essence, that Roman-derived alphabets are so important that it 
> is just as reasonable to always force, e.g., CJK characters 
> --even for extremely common words-- in three or four (or worse) 
> octet codings.

Please don't use CJK as the main example. They use two bytes
all the time anyway, so using 3 (UTF-8) or 4 (UTF-5) or so
isn't that a big hit. And label lengths, in terms of characters,
are going to be much smaller for CJK than for alphabetic
scripts. The main problem cases are scripts such as Devanagari,
Bengali, Tamil, Georgian,... which are alphabetic but require
3 bytes in UTF-8. And even for those, the label length of 64
bytes in DNS is not a significant restriction, although labels
such as
   ThisIsAVeryLongLabelThatIsReallySixtyFourBytesLongToShowThePoint
   0123456789012345678901234567890123456789012345678901234567890123
won't be possible. (sorry if the actual net label length was 63)
(and then there may be Klingon, which, if encoded, will end
on plane 1 or so, with 4 bytes in UTF-8 :-).

When I wrote the first i18n dns draft, I contacted some aquaintances
in Georgia and Island (known for long words), and they said it wouldn't
be a problem. Probably more research is needed, just to be sure.
Maybe somebody could figure out how many registrations there are
e.g. below .com that have label lengths >21 (for 3 bytes per char)
or >16 (for 4 bytes per char), and what they are.



> If the test fails, then it seems to me that one is essentially 
> making a "my language is more important than his language, and 
> will be forever" argument.   Again, that might be a plausible 
> argument (although I'd be embarrassed to make it in public), 
> but, if people want to make it, let's not try to hide it 
> (however unintentionally) behind "minimal or constant codings 
> for ASCII" arguments.

I think one argument for still having ASCII be shorter is that
a lot of domains will have an ASCII equivalent domain name.


Regards,   Martin.


#-#-#  Martin J. Du"rst, World Wide Web Consortium
#-#-#  mailto:duerst@w3.org   http://www.w3.org