[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] length restrictions on IDN label



Soobok Lee <lsb@postel.co.kr> wrote:

> I have a punycode label of length 63 octets:
> L1: zq--o39AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>  
> L2=ToUnicode(L1) produces: U+AC00 x 56 times ( Hangul "KA" repeated 56 times)
> 
> But this L2 can be encoded in various unicode/legacy encodings into
> various lengths of octets:
> 
> UTF8 : 3 x 56 = 168 octets
> UCS2 : 2 x 56 = 112 octets
> UCS4 : 4 x 56 = 224 octets
> KSX1001/EUC-KR : 2 x 56 = 112 octets 
>  
> Many internet applications impose/assumes the 63-octets-limit of
> label lengths.

IDN-unaware applications use this simple 63-octet limit.  These
applications also assume that the domain label is ASCII.  IDN-aware
applications will be careful to use the ASCII form when talking
to IDN-unaware applications.  Applications that use non-ASCII
representations will know the more complex syntax rule for non-ASCII
labels (namely, that the label is valid if and only if ToASCII can be
applied to it without failing).

> From implementators' point of view, more precise specificiation is
> needed about whether IDN label/FQDN has *NEW* length restrictions in
> various char encodings

Section 2 defines "internationalized label" as a label to which the
ToASCII operation can be applied without failing.  There is no other
restriction on IDN label syntax.

> the implementors have practical security-related need to impose some
> limits on the iDN lables in non-ACE encodings. (for example, to avoid
> buffer overflow errors due to expanded ToUnicode labels)

That's true.  A cursory examination of the Punycode algorithm reveals
that each ASCII character can represent at most one code point;
therefore an internationalized label can represent at most 63 code
points, whether it's ACE or not.  A given encoding uses a bounded number
of octets per code point, so you can allocate your buffers based on
that.

> The unit of length restriction matters: # of code points or # of
> octets ? That should be made clearer. RFC1035 uses "octets", not a
> character/code point.

RFC 1035 limits domain labels to 63 octets, but RFC 1035 predates IDNA,
and it speaks under the explicit assumption that text is ASCII.  Because
DNS is IDN-unaware, all internationalized labels in DNS are in their
ASCII forms.  For these reasons, the 63-octet limit applies only to the
ASCII forms of internationalized labels.

IDNA does not introduce any new length restrictions.  The 63-octet limit
on ASCII labels is the only length restriction on internationalized
labels.

> Then, U+AC00 x 56 times (in my previous posting) is a valid label
> conforming to RFC1035 ?

No, it's not, and that's why IDNA requires that it be converted to its
ASCII form before being passed into an IDN-unaware protocol like DNS.

> UTF8-encoded IDN labels are not governed by RFC1035 length
> restrictions ?

Not directly.  The 63-octet limit applies to the ASCII form, not the
UTF-8 form.  It would be absurd to apply the 63-octet limit to every
possible encoding form.  You'd have to transcode a label into every
possible encoding just to check whether it's valid.

> IDNA contains brand new length restrictions for 8bit labels which
> obsoletes RFC1035 ?

No, it contains no new length restrictions.  The RFC 1035 restriction
on the ASCII form is still the only restriction on the length of
internationalized labels.

AMC