[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] What's wrong with skwan-utf8?



--On Friday, 29 December, 2000 12:02 -0500 Edmon
<edmon@neteka.com> wrote:

> Does anyone have a good list of "what happens" if an full
> 8-bit name is received by the different applications?
> The reason we should strive to compile this list is that once
> IDN is deployed, there will be people trying to enter
> multilingual domain names from non-compliant applications.
> There is simply no way of avoiding this, even if we choose to
> go with ACE.
> 
> If something is bound to break or go down, we should know
> about it beforehand so that we can prepare for it.
 
Edmon,

Murphy's Law is really the operant principle here.  If something
significant on which other things depend is changed, some of
those things will respond in disruptive ways.   Indeed, even if
the original software and systems were clearly broken, things
will have been written to adapt to the broken behavior and some
of them will behave poorly when the problem is corrected.
Making lists of software is almost pointless unless one is
willing to say "ok, this will cover 75% of the cases, and it is
just tough luck for the other 25%" or "anything that is broken
will provide incentives to its authors for fixing it quickly and
that might even be a good thing" (I've heard arguments close to
both of those on this list, although I find it hard to
sympathize with them in the case of the DNS).

In the specific case of "8 bit" names, experience indicates that
one of several things will happen:

	(i) Everything will, as Dan has pointed out, work fine.
	
	(ii) The application will discard the high-order bits,
	resulting in a mess.
	
	(iii) The application will convert all "unknown" or
	"invalid" octets (i.e., those with the high bit on or
	those others not permitted by the application in domain
	names) to a single "noise" character.
	
	(iv) The application will produce an "invalid name"
	error message.
	
	(v) The application will treat the octets as arising
	from one of the ISO 8859-NN character sets (which one
	will depend on the application, although 8859-1 is most
	common), rather than as a UTF-8 encoding of ISO 10646/
	Unicode.
	
	(vi) The "invalid" characters will cause the application
	to go down some insufficiently-tested or
	insufficiently-robust code path, resulting in its
	blowing up or fatally crashing.

Note that several of the above can lead, in especially unlucky
cases, to finding a name other than the one the user intended
and hence returning of results that represent a different host
or resource than the one desired.  That type of error, which is
hard to detect, may, from a user or resource-provider
standpoint, be worse than any of the other erroneous outcomes.

Of course, most of these same comments would apply to direct
(not further encoded) use of UCS-4.

We presume, but cannot prove, that seven-bit-only encodings will
not have these problems because they will not trigger behavior
based on the high order bit being unexpectedly turned on.  There
are still possibilities, some based on incorrect code and others
deriving from actions taken to avoid displaying characters that
are not within the capabilities of local devices, of some of the
same issues (especially variants on iii, iv, or vi) occurring
with ACE-like encodings.

    john