[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] IDNA questions



Erik Nordmark <Erik.Nordmark@sun.com> wrote:

> For some code points (like unassigned, private use, non-character, and
> surrogate code points) it seems to make sense to prohibit them in both
> IHL and IHNL.

Agreed.

> But why not take the same approach as for ASCII and allow space
> characters, control characters, etc in INHL?

Good question.  I wouldn't mind seeing those characters allowed in
non-host IDNs.

> The above quote from the IDNA specification with the possible
> interpretations of "protocol" doesn't provide much guidance to
> registries on what code points can be safely registered i.e. does not
> risk to be declared "not a host name" in some later IETF standard.
> 
> I think there can be two different intents here:
> 1. The intent is that all restrictions be solely by registries ("name-
>    handling bodies") based on their local policies of what to register.
>    No application protocols will restrict the set of allowed code points
>    beyond what is currently specific in IDNA+nameprep+stringprep.
> 2. The intent is at a later date the IETF will produce a specification
>    for host name syntax (either for specific application protocols
>    or in general) that will place some additional restrictions.

The reason the intent is unclear is that the IDNA authors never agreed
on the intent, so we agreed on a document that abstained from expressing
the intent.

I personally would like to see #2.  I think the two-syntax model has
served us well.  Today's DNS supports an anything-goes syntax (arbitrary
ASCII) and also defines a restricted well-behaved syntax (LDH host
names).  Having an anything-goes syntax means arbitrary protocols with
bizarre names can still use the DNS.  Having a restricted well-behaved
syntax means people can register names that are almost certain to work
in any protocols that use domain names, and protocols can use almost any
characters they like to delimit names in larger structures.

Notice that the additional host syntax restrictions are usually not
enforced by applications as names are being passed around and compared.
They are usually enforced only when names are created.

> If #2 then why can't the document explicitly state that the
> intent is that no IETF standard restrict what can be used as an
> internationalized host name by prohibiting <list of code points>?
> Presumably <list of code points> could be a reference to a set of
> general categories in the Unicode standard.

That would be a good idea if we settled on #2.

> This would not prevent a future IETF standard to declare that a larger
> set of code points can be used in internationalized host names.  And
> it would not prevent a particular registry to have a policy that only
> allows a smaller set of code points.

Right.

> Avoiding downcasing of ASCII code points by ToASCII in an all-ASCII
> label seems a bit odd.
>
> The reasons I've seen seem to be that some applications would take
> the result of ToASCII and use it in protocol fields such as the
> From: in SMTP. The effect of having ToASCII downcase an all-ASCII
> label is that users will see different case characters when they
> upgrade to IDNA-aware applications even though they do not use an IDN
> themselves. Thus there might be some noticeable short-term effect
> depending on exactly how the IDNA-aware application is coded.

Yes, that is my main concern.  And also the general principle that
extending a standard should not alter the behavior of things that were
already covered in the old standard.

> The reason I'm asking is because when application protocols start
> carrying IDNs in a non-ACE encoding there will still be a need
> to nameprep those domain names in order for comparisons to work.
>
> When that day comes the comparisons of the prepared labels still needs
> to use case-insensitive matching for 00..7F code points just because
> the preparation didn't downcase the ASCII code points.

The IDNA spec does not require applications to actually perform ToASCII
and actually perform a case-insensitive ASCII comparison when comparing
labels.  It merely requires applications to use a procedure that yields
the same answer as that procedure.  One procedure that meets the
requirement is to perform a modified ToASCII that always outputs lower
case everything, and then perform an exact comparison.

Would you like us to mention that optimization in the spec?  We have
generally decided not to describe optimizations in the spec, but rather
let the spec define correct externally-visible behavior and not concern
itself with internal implementation issues.

> Thus the current approach seems a bit short-sighted unless there is a
> good reason that I don't understand for not case folding ASCII-only
> labels.

I think maintaining the current behavior for existing names is a good
reason.

> Section 3 item 3 seems to implicitly prevent Unicode-aware
> applications from comparing two nameprepped strings without first
> converting them to ACE.  Why can't it be made explicit that, as a
> result of the design of the ACE algorithm being 1-1, two Unicode
> labels compare as equal if and only if the corresponding ACE labels
> compare as equal?

Correction:  Two non-ACE labels compare as equal if and only if the
corresponding ACE labels compare as equal.  It's a subtle but important
distinction.

> This would mean that an application, as long as it has applied
> ToUnicode on the strings when received from an input method, the
> protocol, or some other source, it could do a comparison without ACE
> conversion.
>
> And if case folding was applied to ASCII-only labels just like all
> other labels, then the resulting comparison would just need to be a
> code point by code point comparison.

Be careful.  Comparing ToUnicode(X) with ToUnicode(Y) is not a
valid comparison of X and Y, because the output of ToUnicode is not
necessarily nameprepped.  An exact comparison of Nameprep(ToUnicode(X))
and Nameprep(ToUnicode(Y)) is the same as a case-insensitive ASCII
comparison of ToASCII(X) and ToASCII(Y) when X and Y are valid labels.
But when X and Y are invalid, the former comparison might return a match
when the latter comparison would not (because ToASCII would fail).  If
you don't care about the result of comparing two invalid labels (and I
suppose you often wouldn't), then you could use the former comparison.

Would you like this alternate comparison procedure described in the
spec?  Some might see it as unnecessary bloat.  I know a faster but more
complex method of performing ToUnicode, and I once suggested including
both versions in the spec, but we ultimately decided to keep the spec
short and leave optimizations to the implementors.

Maybe an informational RFC full of optimizations could be drafted after
the standard is out of the way.

AMC