[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] I-D ACTION:draft-ietf-idn-idna-08.txt



--On Thursday, 30 May, 2002 19:55 -0500 "Eric A. Hall"
<ehall@ehsco.com> wrote:

>...
> That's one way to read it. Another way to read it is that all
> of the codepoints are opaque, except that ASCII ranges A-Z and
> a-z are to be treated as case-neutral for the purposes of
> comparison ONLY. Which is, in fact, exactly what RFC 1035 says.

Eric,

In the interest of making progress in this discussion (if that
is possible at all), can I ask that you, and others, try to
avoid statements like "exactly what RFC 1035 says".  Many of us,
even among those who believe that RFC 2181 settled the issue
definitively, believe that 1035 is a bit unclear.  And, even
when I suspend disbelief and accept that 1035 intended "binary
labels" to be accepted in any RR, I'd suggest that your
interpretation is not obvious from 1035.

I would suggest, instead, that you would need to define three
types of labels:

	* ASCII-only, interpreted with case-matching
	
	* Character, but not necessarily ASCII, interpreted with
	case-matching in the 0x00 to 0x7F range and as absolute
	matches outside that range.
	
	* Binary, which had best not be subject to case-matching
	comparisons because they might be, e.g., unsigned
	numbers.  I think we can agree that 1035 gives little or
	no advice about how to interpret such labels, although
	it is clear that they can exist, at least for
	as-yet-undefined RR types or Classes.

My problem is that I don't know how to tell the three cases
apart (except by looking at RR type or Class definitions).  If
you believe that only the first two cases exist (which is what
your comments seem to imply), then the first is a special case
of the second and all is well... except that 1034 and 1035
clearly (at least to me) provide for binary content, which would
not get case-dependent interpretation.

If you believe, as I do, that only the first and third cases
exist, then there is still a problem, since a binary label could
reasonably have every 8th bit turned off, making it
indistinguishable from the ASCII ones.

And that is, it seems to me, the solution to the dilemna that
Dan poses.  Regardless of whether it would be a good idea, it
would be plausible for IETF to come along at this point and say
something much more definitive than 2181 does.  One form of that
might be:

	Heretofore, as defined by at least one reading of RFC
	1034/ 1035, the labels associated with A, MX, TXT, SRV,
	etc., RRs in Class IN consisted of ASCII characters,
	with labels containing octets outside the ASCII range
	being treated as undefined (note, the label, not the
	octet) and their use being non-standard.   Henceforth,
	octets with values greater than 0x7F are permitted in
	labels for those RR types.  Octets with values from 0x00
	to 0x7F are to be compared as specified in RFC 1035,
	i.e., case-independently for ASCII graphic characters.
	Octets with values from 0x80 to 0xFF are to be compared
	directly, i.e., without any mapping or translation.
	Such labels, and the associated RRs, will be described,
	for convenience, as "character labels" and "character
	RRs".
	
	Future RRs and Classes may be defined so that their
	labels are binary, i.e., that they are to be interpreted
	without consideration of the content or possible
	mappings of any octet.  New RRs MUST indicate whether
	their labels are "character" or "binary"; new Classes
	MUST specify some appropriate rule, i.e., either
	specification on a per-label basis or a single rule for 
	the Class.

Personally, I don't particularly like the above rule, since it
would give us case-independence for ASCII, but case-dependency
for all other scripts and languages that make distinctions.
That strikes me as a really bad idea, one that would cause far
more trouble in the future.  But it is a decision and change the
IETF could logically make.

More generally, if we are trying to _change_ a rule, or specify
a rule where none existed, I think we need to do two things.
The first is to be sure that our change is compatible and
acceptable, either by identifying the new case as different from
the old (such as with a label) or by assuring ourselves that the
previous behavior was undefined and non-standard and that we can
therefore safely specify a behavior.   The latter probably
requires looking at usage, not just text of documents, but it
should still be possible and, as Dan points out, we have done it
in the past.

Second, we should be very clear that we are making a change.  We
should avoid replacing ambiguous or contradictory text with new
text that claims to clarify or not change anything but that,
intentionally or not, involves tricky reading that later permits
people to come back and say "well, we got this into Proposed
Standard by claiming it wasn't a change, but it now specifies
what was previously unspecified".  If nothing else, the latter
approach tends to lead to a very poor level of specification,
such as whether a label that contains octets in both the
0x00-0x7F and 0x80-0xFF ranges is to be interpreted as "binary"
or with different comparision rules on a per-octet basis.
Similarly, we should avoid pretending that cases that were never
anticipated, and that have odd effects at the margins, were
really what was intended all along.

FWIW, there is also a small performance issue here, which people
seem to be ignoring.  If I know a label is all ASCII graphics
and is to be compared on a case-matching basis, I can mask off
the high bits (on an atomic full-string basis on some hardware)
and then do a full-string compare (again, in hardware on much
hardware).  If I know it is "binary", then I can skip the
masking and do a full-string compare.   But, if some octets are
to be treated a ASCII, and some as binary, I need to construct a
loop and examine each octet in turn to figure out how to compare
it.  Not a good idea if it can be avoided.

    john