[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] I-D ACTION:draft-ietf-idn-idna-08.txt



"Eric A. Hall" <ehall@ehsco.com> wrote:

> The i18n namespace is case-sensitive because of the AMC-Z encoding,
> not because of nameprep.  The original capitalization has to be burned
> to suit the encoding.
>
> As a result, all i18n domain names (unencoded) must be compared as
> case-specific data, by requirement of the codec.

You have totally lost me.  First, what does "case-specific" mean?

I'll start by explaining exactly what "case-sensitive" and
"case-insensitive" mean (as I understand the terms).  To say that a
text string is "case-insensitive" is to say that changing characters
from upper to lower case or vice-versa does not change the identity
of the string; in other words, case differences are ignored when the
string is compared to other strings.  To say that a text string is
"case-sensitive" is to say that changing characters from upper to lower
case or vice-versa does change the identity of the string; in other
words, case differences are not ignored when the string is compared.
Neither term says anything about whether the string is capable or
incapable of preserving mixed-case text; that is an orthogonal question.

So there are four possibilities:  A string can be case-sensitive
and case-preserving (like Unix file names), or case-insensitive
and case-preserving (like Macintosh and Amiga file names), or
case-insensitive and non-case-preserving (like MS-DOS file names), or
case-sensitive and non-case-preserving (I can't think of any real-world
examples).

For IDNA, we wanted the mapping between non-ASCII labels and ACE labels
to have the following property:  A case-insensitive comparison of
two ACE labels always returns the same answer as a case-insensitive
comparison of the two corresponding non-ASCII labels.  In order
to achieve that property, a case-folding step is essential in the
definition of the ACE mapping.  (Maybe you don't want that property, but
it was a fundamental design goal of IDNA, in order to avoid surprising
users with different comparison rules for ASCII versus non-ASCII names.)

The codec defined by IDNA has a number of steps, of which the two
biggest are Nameprep and Punycode, but it's all a single codec.
Punycode is broken out as a separate step because its implementation
is independent of the other steps and because Punycode might be useful
for things other than domain names.  But for IDNs, Punycode is not the
codec, it is merely one step of the codec, and Nameprep is another step
of the codec.

IDNA nowhere suggests the use of case-sensitive comparisons between
domain names in any form.

> If somebody needs an RR that preserves case, there's no reason they
> shouldn't be able to do so.

Agreed.  And I suggest two ways they might do this:  (1) Define a
case-sensitive data format and don't call it a domain name.  (2) Define
a mapping from non-ASCII domain labels to ASCII domain labels that
doesn't involve case-folding.  But don't call this mapping IDNA, because
it's not.  It might look very similar to IDNA, and might reuse pieces of
it, but it shouldn't use the IDNA ACE prefix for a mapping that is not
the IDNA ACE mapping.

> Labels are currently stored, transferred and compared as
> octet-streams, with the exception being that ASCII A-z is compared as
> case-insensitive.

That's how most DNS servers handle 8-bit labels in practice, but I
still don't think they're required to do so, I just think it's the best
effort they can make given that they don't know the charset.  But we've
heard reports of some new DNS servers that try to guess the charset,
in which case they might then do case-insensitive comparisons even for
the non-ASCII characters.  And there's still the wide world of entities
other than DNS servers, which also compare domain names, and their
handling of 8-bit names is even less predictable.

> you are arguing for mandatory lowercasing for storage and transfer in
> addition to comparison.

Yes, because that's the only way to achieve that property I mentioned
above.

> > Now consider an entity that knows that föo and FÖO and xx--fo-fka
> > and xx--FO-ohA are domain labels, but does not know that they are
> > special labels that don't use Nameprep.
>
> Why would it ask for a special RR that it doesn't know how to read?

Because it might be a caching DNS server.

If you only care about the end applications, which know the special
semantics of the special labels, then just use a different prefix to
go with your different Stringprep profile.  Then you can be sure that
entities that know IDNA but don't know about your special labels won't
accidentally muck with them.

> > Nameprep can of course produce identical output for two distinct
> > inputs.  But for two distinct outputs of Nameprep, ToASCII cannot
> > produce the same output.
>
> Is the encoding form guranteed to always be reversible to the original
> capitalization?

I'm not sure what you mean by "encoding form".  The ACE form (which
involves both Nameprep and Punycode) is not guaranteed to be reversible
to the original capitalization.  There is a mechanism called "mixed-case
annotation", described in Appendix B of the Punycode spec, which can
remember any capitalization you want (except for titlecase), but the
ToASCII and ToUnicode operations described in the IDNA spec neither
create nor apply these annotations.  It's possible to write a conformant
ToASCII that creates them, and a conformant ToUnicode that applies them,
but the IDNA spec does not mention this possibility, mainly because the
first two authors have never been convinced that it's really safe or
effective.  :)  (I originally proposed this mechanism as a way to make
IDNs case-preserving like ASCII domain names are, but I long ago gave up
on that issue.)

There is no analogous mechanism for recovering a non-normalized string.

Punycode by itself can encode any arbitrary sequence of non-negative
integers, and always decodes to exactly the same sequence.

AMC