[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Some comments on Hierarchy and Restoration draft



All - here are some comments from Steve Plote of Looking Glass Networks.
See comments in <<>> below. I also gave some comments on organization - has
a new version been sent yet? (have I missed it)?

thanks,
---rob

------------------------------------------------------------------------
Rob, I have added some comments to the draft.  Not sure where
   you want to take this.  However, I know it will move to be in-depth.
   Please review the comments and any questions or clarifications; please
   let me know.

   I know that you are on track with the hierarchy of restoration between
   layers and the nested timers to enable restoration at various layers;
   based upon failure to restore at the lower layers.
   WRT the prioritization of services for shared restoration bandwidth and
   breaking the network into regions with BGNs that is pretty common and a
   good scheme; as well.

   A lot of the implementation work for some of the BGN and interior node
   protocols is proprietary and not sure if it can be shared with the IETF;
   as patents had been applied for 2 years ago.   I have since left that
   company and they have not kept me abreast of the patent application
   progress.  However, I can check back with the patent attorneys and see
   if anything has happened.
   <<draft-team-tewg-restore-hierarchy-00.txt>>
   Regards,

   Steve Plote
   Director of Technology
   Looking Glass Networks
   15303 Dallas Parkway Suite 400
   Addison, TX   75001
   email: steve.plote@lglass.net
   (v) 972-759-9117
   (c) 214-502-6201





   Traffic Engineering Working Group                     Wai Sum Lai, AT&T
   Internet Draft                                   Dave McDysan, WorldCom
   <draft-team-tewg-restore-hierarchy-00.txt>                 (Co-Editors)
   Category: Informational
   Expiration Date: January 2002                                 Jim Boyle
                                                            Malin Carlzon
                                                      Rob Coltun, Redback
                                                        Tim Griffin, AT&T
                                                          Ed Kern, Cogent
                                                   Tom Reddington, Lucent

                                                                July 2001


               Network Hierarchy and Multilayer Survivability

   Status of this Memo

     This document is an Internet-Draft and is in full conformance with
        all provisions of Section 10 of RFC2026 [1].

     Internet-Drafts are working documents of the Internet Engineering
     Task Force (IETF), its areas, and its working groups. Note that
     other groups may also distribute working documents as Internet-
     Drafts. Internet-Drafts are draft documents valid for a maximum of
     six months and may be updated, replaced, or obsoleted by other
     documents at any time. It is inappropriate to use Internet- Drafts
     as reference material or to cite them other than as "work in
     progress."

     The list of current Internet-Drafts can be accessed at
     http://www.ietf.org/ietf/1id-abstracts.txt

     The list of Internet-Draft Shadow Directories can be accessed at
     http://www.ietf.org/shadow.html.


   1. Abstract

     This document is the deliverable out of the Network Hierarchy and
     Survivability Techniques Design Team established within the Traffic
     Engineering Working Group.  This team was requested to try to
     determine what the current and near term requirements are for
     survivability and hierarchy in MPLS networks.  The team determined
     that there appears to be a need for common, interoperable
     survivability approaches in packet and non-packet networks.
     Suggested approaches include path-based as well as one that repairs
     connections in proximity to the network fault.  For clarity, an
     expanded set of definitions is included.  As for hierarchy, there
     did not appear to be as much need for work on "vertical hierarchy,"
     defined as communication between network layers such as TDM/optical
     and MPLS.  In particular, instead of direct exchange of signaling
     and routing between vertical layers, some looser form of
     coordination and communication is a nearer term need.  For

   Lai, et al              Category - Expiration                     [1]
              Network Hierarchy and Multilayer Survivability   July 2001


     "horizontal hierarchy" in data networks, there does appear to be a
     pressing need.  This requirement is often presented in the context
     of layer 2 and layer 3 VPN services where SLAs would appear to
     necessitate signaling from the edges into the core of a network.
     Issues include potential current protocols limitations in networks
     which are hierarchical (e.g. multi-area OSPF) and scalability
     concerns of potentially O(N^2) connection growth in larger networks.

                Please send comments to te-wg@ops.ietf.org


   2. Conventions used in this document

     The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
     "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in
     this document are to be interpreted as described in RFC-2119 [2].


   3. Introduction

     This document presents a proposal of the tangible requirements for
     network survivability and hierarchy in current service provider
     environments.  With feedback from the working group solicited, the
     objective is to help focus the work that is being addressed in the
     traffic engineering, ccamp and other working groups.  A main goal of
     this work is to provide some expedience for required functionality
     in multi-vendor service provider networks.  The initial focus is
     primarily on intra-domain operations.  However, to maintain
     consistency in the provision of end-to-end service in a multi-
     provider environment, rules governing the operations of
     survivability mechanisms at domain boundaries must also be
     specified.  While such issues are raised and discussed, where
     appropriate, they will not be treated in depth in the initial
     release of this document.

     The document first develops a set of definitions to be used later in
     this document and potentially in other documents as well.  It then
     addresses the requirements and issues associated with service
     restoration, hierarchy, and finally a short discussion of
     survivability in hierarchical context.


   4. Definitions

   4.1 Hierarchy Terminology


     Network hierarchy is an abstraction of part of a network's topology
     and the routing and signaling mechanism needed to support the
     topological abstraction.  Abstraction may be used as a mechanism to
     build large networks or as a technique for enforcing administrative,
     topological or geographic boundaries.  For example, network
     hierarchy might be used to separate the metropolitan and long-haul


   Lai, et al              Category - Expiration                       2
              Network Hierarchy and Multilayer Survivability   July 2001


     regions of a network or to separate the regional and backbone
     sections of a network [Bert Wijnen], or to interconnect service
     provider networks (with BGP which reduces a network to an Autonomous
     System).  In this document, network hierarchy is considered from two
     perspectives:
     (1) Horizontally oriented: between two areas or administrative
     subdivisions within the same network layer
     (2) Vertically oriented: between two network layers

     Horizontal hierarchy is the abstraction necessary to allow a network
     at one network layer, for instance a packet network, to grow.
     Examples of horizontal hierarchy include BGP and multi-area OSPF.

     Vertical hierarchy is the abstraction, or reduction in information,
     which would be of benefit when communicating information across
     network layers, as in propagating information between optical and
     router networks.


   4.2 Survivability Terminology


     Extra traffic is the traffic carried over the protection entity
     while the working entity is active.  Extra traffic is not protected,
     i.e., when the protection entity is required to protect the traffic
     that is being carried over the working entity (e.g., due to a
     failure affecting the working entity), the extra traffic is
     preempted.

     Normalization is the return to the normal state of a network upon
     completing the repair of the network failure.  This could include
     the rerouting of affected traffic to the original working entities
     or new routes.  The term revertive mode is used when traffic is
     returned to the working entity (switch back).

     Protection, also called protection switching, is a survivability
     technique based on predetermined failure recovery: as the working
     entity is established, resources are reserved for the protection
     entity.  These resources may be used by low-priority traffic
     (referred to as extra traffic) if traffic preemption is allowed.
     Depending on the amount of reserved resources, not all of the
     affected traffic may be protected.  (For further discussion of
     concepts related to protection, see the Sub-section below on
     Survivability Concepts.)

     Protection entity (also called back-up entity or recovery entity) is
     the entity that is used to carry protected traffic in protection
     operation mode, i.e., when the working entity is in error or has
     failed.

     Recovery is the sequence of actions taken by a network after the
     detection of a failure to maintain the required performance level
     for existing services (e.g., according to service level agreements)
     and to allow normalization of the network.  The actions include


   Lai, et al              Category - Expiration                       3
              Network Hierarchy and Multilayer Survivability   July 2001


     notification of the failure followed by two parallel processes: (1)
     a repair process with fault isolation and repair of the failed
     components, and (2) a reconfiguration process with path selection
     and rerouting for the affected traffic.

     Rerouting is placement of affected traffic from the working entity
     to the protection entity, when the path for the protection entity
     has been selected after the detection of a fault on the working
     entity.  This is synonymous with switch-over in protection
     techniques.  (In [3], rerouting is synonymous with restoration.)

     Restoration is a survivability technique that dynamically discovers
     the alternate path from spare resources in network, or establishes
     new paths on demand, for affected traffic once the failure is
     detected and the affected traffic is identified for rerouting.  The
     new path may be based on preplanned configurations or current
     network status.  Thus, restoration involves a path selection process
     followed by traffic rerouting. (In [3], restoration is referred to
     as recovery by rerouting.)

     Restoration, or more specifically, service restoration, refers to
     the actions taken by a network to maintain service continuity after
     the detection of a failure.  In this second usage, restoration has a
     meaning very similar to recovery, except that restoration covers
     only the reconfiguration process and not the repair process.  Also,
     in this usage, it should be clear from the context that it is
     irrelevant whether the survivability technique used to achieve
     service continuity is based on protection or restoration techniques.

     Restoration time is the time interval from the occurrence of a
     network impairment to the instant when the affected traffic is
     either completely rerouted or until spare resources are exhausted
     and/or no more preemptable traffic to make room.

     Revertive mode is a procedure in which revertive action, i.e.,
     switch back from the protection entity to the working entity, is
     taken once the failed working entity has been repaired.  In non-
     revertive mode, such action is not taken.  To minimize service
     interruption, switch-back in revertive mode should be performed at a
     time when there is the least impact on the traffic concerned, or by
     using the make-before-break concept.

     Shared risk group (SRG) is a set of network elements that are
     collectively impacted by a specific fault or fault type. For
     example, a shared risk link group (SRLG) is the union of all the
     links on those fibers that are routed in the same physical conduit
     in a fiber-span network.  This concept includes, besides shared
     conduit, other types of compromise such as shared fiber cable,
     shared right of way, shared optical ring, shared office without
     power sharing, etc.  The span of an SRG, such as the length of the
     sharing for compromised outside plant, needs to be considered on a
     per fault basis.


   Lai, et al              Category - Expiration                       4
              Network Hierarchy and Multilayer Survivability   July 2001


     Survivability is the capability of a network to maintain service
     continuity in the presence of faults within the network [4].
     Survivability techniques such as protection and restoration are
     implemented either on a per-link basis, on a per-path basis, or
     throughout an entire network to alleviate service disruption at
     affordable costs.  The degree of survivability is determined by the
     network's capability to survive single failures, multiple failures,
     and equipment failures.

     Working entity is the entity that is used to carry traffic in normal
     operation mode.  Depending on the context, an entity can be, e.g., a
     channel or a transmission link in the physical layer, an LSP in
     MPLS, or a logical bundle of one or more LSPs.

   4.3 Survivability Concepts


     In a survivable network design, spare capacity and diversity must be
     built into the network from the beginning to support some degree of
     self-healing whenever failures occur.  A common strategy is to
     associate each working entity with a protection entity having either
     dedicated resources or shared resources that are pre-reserved or
     reserved-on-demand.  According to the methods of setting up a
     protection entity, different approaches to providing survivability
     can be classified.  Generally, protection techniques are based on
     having a dedicated protection entity set up prior to failure.  Such
     is not the case in restoration techniques, which mainly rely on the
     use of spare capacity in the network.  Hence, in terms of trade-
     offs, protection techniques usually offer fast recovery from failure
     with enhanced availability, while restoration techniques usually
     achieve better resource utilization.

     Protection techniques can be implemented by several architectures:
     1+1, 1:1, 1:n, and m:n.  In the context of SDH/SONET, they are
     referred to as Automatic Protection Switching (APS).

     In the 1+1 protection architecture, a protection entity is dedicated
     to each working entity.  The dual-feed mechanism is used whereby the
     working entity is permanently bridged onto the protection entity at
     the source of the protected domain.  In normal operation mode,
     identical traffic is transmitted simultaneously on both the working
     and protection entities.  At the sink of the protected domain, both
     feeds are monitored for alarms and maintenance signals.  A selection
     between the working and protection entity is made based on some
     predetermined criteria, such as the transmission performance
     requirements or defect indication.  This architecture is rather
     expensive since resource duplication is required.  It is generally
     used for specific services that need a very high availability.

     In the 1:1 protection architecture, a protection entity is also
     dedicated to each working entity.  The protected traffic is normally
     transmitted by the working entity.  If the working entity has
     failed, the protected traffic is rerouted to the protection entity.


   Lai, et al              Category - Expiration                       5
              Network Hierarchy and Multilayer Survivability   July 2001


     This architecture is inherently slower in recovering from failure
     than a 1+1 architecture since communication between both ends of the
     protection domain is required to perform the switch-over operation.
     An advantage is that the protection entity can optionally be used to
     carry preemptable "extra traffic" in normal operation.  Also, in
     packet networks, a protection path can be pre-established for later
     use with pre-planned but not pre-reserved capacity.  (If no packets
     are sent into a link, no bandwidth is consumed.)  This is not the
     case in channelized transport networks.

     In the 1:n protection architecture, a dedicated protection entity is
     shared by n working entities.  Traffic is normally sent on the
     working entities.  When multiple working entities have failed
     simultaneously, only one of them can be restored by the common
     protection entity.  This contention is resolved by assigning a
     different preemptive priority to each working entity.  As in the 1:1
     case, the protection entity can optionally be used to carry
     preemptable "extra traffic" in normal operation

     The m:n architecture is a generalization of the 1:n architecture.
     Typically m <= n, m dedicated protection entities are shared by n
     working entities.   While this architecture can improve system
     availability with small cost increases, it has rarely been
     implemented or standardized.


   5. Survivability


   5.1 Scope

     Interoperable approaches to network survivability were determined to
     be an immediate requirement in packet networks as well as in
     SDH/SONET framed TDM networks.  Not as pressing at this time were
     techniques which would cover all-optical networks (e.g., where
     framing is unknown), as the control of these networks in a multi-
     vendor environment appeared to have some other hurdles to first deal
     with.  Also, not of immediate interest were approaches to coordinate
     or explicitly communicate survivability mechanisms across network
     layers (such as from a TDM or optical network to/from an IP
     network).  However, a capability should be provided for a network
     operator to control the operation of survivability mechanisms among
     different layers.  Such issues and those related to OAM are
     currently outside the scope of this document.  (For proposed MPLS
     OAM requirements, see [5]).

     The types of network failures that cause a restoration to be
     performed include link/span and node failures (which might include
     span failures at lower layers).  Other more complex failure
     mechanisms such as systematic control-plane failure or breach of
     security are not within the scope of the survivability mechanisms
     discussed in this document.



   Lai, et al              Category - Expiration                       6
              Network Hierarchy and Multilayer Survivability   July 2001


   5.2 Required initial set of survivability mechanisms


   5.2.1   1:1 Path Protection with Pre-Established Capacity

     In this protection mode, the head end of a working connection
     establishes a protection connection to the destination.  In normal
     operation, traffic is only sent on the working connection, though
     the ability to signal that traffic will be sent on both connections
     (1+1 Path for signaling purposes) would be valuable in non-packet
     networks.  Some distinction between working and protection
     connections is likely, either through explicit objects, or
     preferably through implicit methods such as general classes or
     priorities.  Head ends need the ability to create connections that
     are as failure disjoint as possible from each other.  This would
     require SRG information that can be generally assigned to either
     nodes or links and propagated through the control or management
     plane.  In this mechanism, capacity in the protection connection is
     pre-established, however it can be used to carry preemptable extra
     traffic.  Protect capacity is first come first served.  When protect
     capacity is called into service during restoration, there should be
     the ability to promote the protection connection to working status
     (for non-revertive mode operation) with some form of make-before-
     break capability.
   <<Protect capacity should not be assigned on a first come first served basis but
   based upon the service priority groupings.... ie Mission Critical traffic gets first
   restoration priority.  Every other service type waits.  Then the 2nd priority traffic
   get restored, etc.  Obviously this does not apply for a 1+1 protection implementation.>>

   5.2.2   1:1 Path Protection with Pre-Planned Capacity

     Similar to the above 1:1 protection with pre-established capacity,
     the protection connection in this case is also pre-signaled.  The
     difference is in the way protect capacity is assigned.  With pre-
     planned capacity, the mechanism supports the ability for the protect
     capacity to be shared, or "double-booked."  It would be expected
     that should operator predicted failures occur, which potentially
     could rely on enumeration in SRGs, that only a limited set of
     protect connections would be put into service, and that the protect
     capacity available in the network would be able to fulfill this
     traffic (given proper sizing and planning of the network).  In a
     sense, this is 1:1 from a path perspective, however the protect
     capacity in the network (on a link by link basis) is shared in a 1:n
     fashion.  Some form of information propagation could be required
     before traffic may be sent on protection connections, especially in
     TDM networks.  In data networks, a desirable operating approach for
     this mechanism might be where the protect capacity is not accurately
     booked against SRGs (e.g. non-predictive).

     The use of this approach improves network resource utilization, but
     may require more careful planning.  So, initial deployment might be
     based on 1:1 path protection with pre-established capacity and the
     local restoration mechanism to be described next.

   5.2.3   Local Restoration




   Lai, et al              Category - Expiration                       7
              Network Hierarchy and Multilayer Survivability   July 2001


     Due to the time impact of signal propagation, path-based approaches
     may not be able to meet the service requirements desired in some
     networks.   The solution to this is to restore connectivity in
     immediate proximity to the fault.  At a minimum, this approach
     should be able to protect against connectivity-type SRGs, though
     protecting against node-based SRGs might be worthwhile.  After local
     restoration is in place, it is likely that head end systems would
     later perform some path-level re-grooming.  Head end systems must
     have some control as to whether their connections are candidates for
     or excluded from local restoration.
   << Easy way to make sure connections are excluded from local restoration is
   to make the best effort and pre-emptible.  That way they only get restored
   if there is bandwidth available >>

   5.2.4   Path Restoration

     In this approach, connections that are impacted by a fault are
     rerouted by the originating network element upon notification of
     connection failure.  This approach does not involve any new
     mechanisms.  It merely is a mention of another common approach to
     protecting against faults in a network.
   << Sourced based routing is most efficient for network resources; but typically longer
   restoration times.>>

   5.3 Applications Supported

     With service continuity under failure as a goal, a network is
     "survivable" if, in the face of a network failure, connectivity is
     interrupted for a brief period and then restored before the network
     failure ends.  The length of this interrupted period is dependent on
     the application supported.  Here are some typical applications that
     need to be considered:

     - Best-effort data: restoration of network connectivity by rerouting
       at the IP layer would be sufficient
     - Premium data service: need to meet TCP or application protocol
       timer requirements
     - Voice: call cutoff is in the range of 140 msec to 2 sec
     - Other real-time service (e.g., streaming, fax)
     - Mission-critical applications
   << Mission Critical applications should only be supported with 1+1 or 1:1
   and local restoration schemes.  Voice has been presumed to restore within 50ms.>>

   5.4 Timing Bounds for Service Restoration

     The approach to picking the types of survivability mechanisms
     recommended was to consider a spectrum of mechanisms that can be
     used to protect traffic with varying characteristics of
     survivability and speed of restoration, and then attempt to select a
     few general points which provide some coverage across that spectrum.
     The focus of this work is to provide requirements to which a small
     set of detailed proposals may be developed, allowing the operator
     some (limited) flexibility in approaches to meeting their design
     goals in engineering multi-vendor networks.  Requirements of
     different applications as listed in the previous sub-section were
     discussed generally, however none on the team would likely attest to
     the scientific merit of the ability of the timing bounds below to
     meet any specific application?? needs.  A few assumptions include:



   Lai, et al              Category - Expiration                       8
              Network Hierarchy and Multilayer Survivability   July 2001


     Approaches that protection switch without propagation of information
     are likely to be faster than those that do require some form of
     fault notification to some or all elements in a network.
     Approaches that require some form of signaling after a fault will
     also likely suffer some timing impact.

     Proposed timing bounds for service restoration for different
     mechanisms are as follows (all bounds are exclusive of signal
     propagation):

     1:1 path protection with pre-established capacity:   100-500 ms
     1:1 path protection with pre-planned capacity:       100-750 ms
     Local restoration:                                   50 ms
     Path restoration:                                    1-5 seconds

     To ensure that the service requirements for different applications
     can be met within the above timing bounds, restoration priority is
     used to determine the order in which connections are restored (to
     minimize service restoration time as well as to gain access to
     available spare capacity).  For example, mission critical
     applications may require high restoration priority.  Preemption
     priority should only be used in the event that all connections
     cannot be restored, in which case connections with lower preemption
     priority should be released.  Depending on a service provider's
     strategy in provisioning network resources for backup, preemption
     may or not be needed in the network.

   <<Your 1:1 path protection times are assuming that there is no local restoration;
   either at a ring level or a 1+1 or 1:1 link level.  These typically require signaling back
   to the source node to begin the restoration process.>>

   5.5 Coordination Among Layers

     A common design goal for multi-layered networks is to provide the
     desired level of service in the most cost-effective manner.  The use
     of multilayer survivability might allow the optimization of spare
     resources through the improvement of resource utilization by sharing
     spare capacity across different layers, though further
     investigations are needed.  Coordination during service restoration
     among different network layers (e.g. IP, SDH/SONET, optical layer)
     might necessitate development of vertical hierarchy.  The benefits
     of providing survivability mechanisms at multiple layers, and the
     optimization of the overall approach, must be weighed with the
     associated cost and service impacts.

     A default coordination mechanism for inter-layer interaction could
     be the use of nested timers and current SDH/SONET fault monitoring,
     as has been done traditionally for backward compatibility.  Thus,
     when lower-layer restoration happens in a longer time period than
     higher-layer restoration, a hold-off timer is utilized to avoid
     contention between the different single-layer recovery schemes.  In
     other words, multilayer interaction is addressed by having
     successively higher multiplexing levels operate at restoration time
     scale greater than the next lowest layer.  Currently, if SDH/SONET
     protection switching is used, MPLS recovery timers must wait until
     SDH/SONET has had time to switch.
   << Yes we agree with the need for nested timers to enact restoration schemes for
   each layer of the hierarchy.  Do you have any proposed timer delays.  SONET/SDH can
   be assumed to only restore at the local level; not the network level.>>


   Lai, et al              Category - Expiration                       9
              Network Hierarchy and Multilayer Survivability   July 2001


     It was felt that the current approach to coordination of
     survivability approaches currently did not have significant
     operational shortfalls.  These approaches include protecting traffic
     solely at one layer (e.g. at the IP layer over linear WDM, or at the
     SDH/SONET layer).  Where survivability mechanisms might be deployed
     at several layers, such as when a routed network rides a SDH/SONET
     protected network, it was felt that current coordination approaches
     were sufficient in many cases.  One exception is the hold-off of
     MPLS recovery until the completion of SDH/SONET protection switching
     as described above.  This limits the recovery time of fast MPLS
     restoration.  Also, note that failures within a layer can be guarded
     against by techniques either in that layer or at a higher layer, but
     not in reverse.  Thus, the optical layer cannot guard against
     failures in the IP layer such as router system failures, line card
     failures.

   << The physical layer does protect against line card failures.  It
     should be able to switch/restore prior to layer 3 restoration schemes.>>

   5.6 Evolution Toward IP Over Optical

     As more pressing requirements for survivability and horizontal
     hierarchy for edge-to-edge signaling are met with technical
     proposals, it is believed that the benefits of merging (in some
     manner) the control planes of multiple layers will be outlined.
     When these benefits are self-evident, it would then seem to be the
     right time to review if vertical hierarchy mechanisms are needed,
     and what the requirements might be.

   6. Hierarchy Requirements

     Efforts in the area of network hierarchy should focus on mechanisms
     that would allow more scalable edge-to-edge signaling, or signaling
     across networks with existing network hierarchy (such as multi-area
     OSPF).  This would appear to be a more immediate need than
     mechanisms that might be needed to interconnect networks at
     different layers.

   6.1 Historical Context

     One reason for horizontal hierarchy is functionality (e.g., metro
     versus backbone).  Geographic ????ands????duce the need for
     interoperability and make administration and operations less
     complex.  Using a simpler, more interoperable, survivability scheme
     at metro/backbone boundaries is natural for many provider network
     architectures.  In transmission networks, creating geographic
     islands of different vendor equipment has been done for a long time
     because multi-vendor interoperability has been difficult to achieve.
     Traditionally, providers have to coordinate the equipment on either
     end of a "connection," and making this interoperable reduces
     complexity.  A provider should be able to concatenate survivability
     mechanisms in order to provide a "protected link" to the next higher
     level.  Think of SDH/SONET rings connecting to TDM DXCs with 1+1
     line-layer protection between the ADM and the DXC port.  The TDM
     connection, e.g., a DS3 is protected, but usually all equipment on
     each SDH/SONET ring is from a single vendor.  The DXC cross

   Lai, et al              Category - Expiration                      10
              Network Hierarchy and Multilayer Survivability   July 2001


     connections are controlled by the provider and the ports are
     physically protected resulting in a highly available design.  Thus,
     concatenation of survivability approaches can be used to cascade
     across horizontal hierarchy.  While not perfect, it is workable in
     the near- to mid-term until multi-vendor interoperability is
     achieved.

     While the problems associated with multi-vendor interoperability may
     necessitate horizontal hierarchy as a practical matter (at least
     this has been the case in TDM networks), there may be no technical
     reason for it.  Members of the team with more experience on IP
     networks felt there should be no need for this in core networks, or
     even most access networks.

     Some of the largest service provider networks currently run a single
     area/level IGP.  Some service providers, as well as many large
     enterprise networks, run multi-area OSPF to gain increases in
     scalability.  Often, this was from an original design, so it is
     difficult to say if the network truly required the hierarchy to
     reach its current size.
   << This allows local/regional restoration schemes as opposed to network level restoration
   schemes.  More likely to converge but less likely to be bandwidth efficient.  Then there
   are the boundary gateway nodes that have to deal with the local restoration protocol and
   some level of hierarchical restoration protocol to enable network level restoration.>>

     Some proposals on improved mechanisms to address network hierarchy
     have been suggested [6, 7, 8].  This document aims to provide the
     concrete requirements so that these and other proposals can first
     aim to meet some limited objectives.

   6.2 Applications for Horizontal Hierarchy

     A primary driver for intra-domain horizontal hierarchy is signaling
     scalability in the context of edge-to-edge VPNs, potentially across
     traffic-engineered data networks.  There are a number of different
     approaches to VPNs and they are currently being addressed by
     different emerging protocols: RFC 2547bis BGP/MPLS VPNs, provider-
     provisioned VPNs based upon MPLS tunnels (e.g., virtual routers),
     Pseudo Wire Edge-to-edge Emulation (PWE3), etc.  These may or not
     need explicit signaling from edge to edge, but it is a common
     perception that in order to meet SLAs, some form of edge-to-edge
     signaling is required.
   << This expresses what I stated above in 6.1.  One common signalling layer would
   reduce the complexity of the restoration task if not handled by local schemes.>>

     For signaling scalability, there are probably two types of network
     scenarios to consider:

     - Large SP networks with flat routing domains where edge-to-edge
       (MPLS) signaling as implemented today would probably not scale.
     - Networks which would like to signal edge-to-edge, and might even
       scale in a limited application. However, they are hierarchically
       routed (e.g. OSPF areas) and current implementations, and
       potentially standards prevent signaling across areas.  This
       requires the development of signaling standards that support
       dynamic establishment and potentially restoration of LSPs across a
       2-level IGP hierarchy.



   Lai, et al              Category - Expiration                      11
              Network Hierarchy and Multilayer Survivability   July 2001


     Scalability is concerned with the O(N^2) properties of edge-to-edge
     signaling.  For a large network, maintaining a "connection" between
     every edge is simply not scalable.  Even if establishing and
     maintaining connections is feasible, there might be an impact on
     core survivability mechanisms which would cause restoration times to
     grow with N^2, which would be undesirable.  While some value of N
     may be inevitable, approaches to reduce N (e.g. to pull in from the
     edge to aggregation points) might be of value.

     For routing scalability, especially in data applications, a major
     concern is the amount of processing/state that is required in the
     variety of network elements.  If some nodes might not be able to
     communicate and process the state of every other node, it might be
     preferable to limit the information.  There is one way of thought
     that says that the amount of information contained by a horizontal
     barrier should be significant, and that impacts this might have on
     optimality in route selection and ability to provide global
     survivability are accepted tradeoffs.

   6.3 Horizontal Hierarchy Requirements

     Mechanisms are required to allow for edge-to-edge signaling of
     connections through a network.  The types of network scenarios
     include large networks with a large number of edge devices and flat
     interior routing, as well as medium to large networks which
     currently have hierarchical interior routing such as multi-area OSPF
     or multi-level IS-IS.  The primary context of this is edge-to-edge
     signaling which is thought to be required to assure the SLAs for the
     layer 2 and layer 3 VPNs that are being carried across the network.
     Another possible context would be edge-to-edge signaling in TDM
     SDH/SONET networks, where metro and core networks again might either
     be in a flat or hierarchical interior routing domain.

   << This capability of SDH/SONET networks providing edge-to-edge signaling will
   not happen with the OSI 7 layer stack model elements.  This may only be feasible
   for the IP over DCC managed elements; if we want to perform the signaling in-band.
   There are all kinds of out of band schemes to provide this edge-to-edge connectivity.
   The problem is having an out of band network to support this in a very large network.
   This gets back to the DCC interoperability being the key for TDM networks to work at
   the signaling level.  We have always assumed that MPLS will be the way to get layer2/3
   VPN data networks to signal/interoperate.>>

   7. Survivability and Hierarchy

     When horizontal hierarchy exist in a network layer, a question
     arises as to how survivability can be provided along a connection
     which crosses hierarchical boundaries.

     In designing protocols to meet the requirements of hierarchy, an
     approach to consider is that boundaries are either clean, or are of
     minimal value.  However, the concept of network elements that
     participate on both sides of a boundary might be a consideration
     (e.g. OSPF ABRs).  That would allow for devices on either side to
     take an intra-area approach within their region of knowledge, and
     for the ABR to do this in both areas, and splice the two protected
     connections together at a common point (granted it is a common point
     of failure now).  If the limitations of this approach start to
     appear in operational settings, then perhaps it would be time to
     start thinking about route-servers and signaling propagated
     directives.  However, one initial approach might be to signal
     through a common border router, and to consider the service as
     protected as it consist of a concatenated set of connections which

   Lai, et al              Category - Expiration                      12
              Network Hierarchy and Multilayer Survivability   July 2001


     are each protected within their area.  Another approach might be to
     have a least common denominator mechanism at the boundary, e.g., 1+1
     port protection.  There should also be some standardized means for a
     survivability scheme on one side of such a boundary to communicate
     with the scheme on the other side regarding the success or failure
     of the service restoration action.  For example, if a part of a
     "connection" is down on one side of such a boundary, there is no
     need for the other side to recover from failures.

   << I have worked on these issues in a single vendor network with different Hierarchies
   of signaling between Boundary Gateway Nodes and Interior Nodes (BGN-BGN and BGN-IN) for
   optical equipment.  The concept of running a different mesh restoration scheme at the
   BGN lavel based upon 1+1 was a costly but effective method to provide end to end
   restoration times that were in the 100s of milliseconds.  If there was a failure between BGNs
   it was protected 1+1; if the failure was within a boundary, the local nodes attempted to
   restore a path back out to the BGN and to the rest of the network.  There are lots of options
   here; just none universal.  I guess that is why we are trying to standardize a process.>>


     In summary, at this time, approaches that allow concatenation of
     survivability schemes across hierarchical boundaries should provide
     sufficient.


   8. Security Considerations

     Security is not considered in this initial version.


   9. References


     1  Bradner, S., "The Internet Standards Process -- Revision 3", BCP
        9, RFC 2026, October 1996.

     2  Bradner, S., "Key words for use in RFCs to Indicate Requirement
        Levels", BCP 14, RFC 2119, March 1997

     3  V. Sharma, B. Crane, K. Owens, C. Huang, F. Hellstrand, J. Weil,
        L. Andersson, B. Jamoussi, B. Cain, S. Civanlar, and A. Chiu,
        "Framework for MPLS-based Recovery," Internet-Draft, Work in
        Progress, March 2001.

     4  D.O. Awduche, A. Chiu, A. Elwalid, I. Widjaja, and X. Xiao, "A
        Framework for Internet Traffic Engineering," Internet-Draft, Work
        in Progress, May 2001.

     5  N. Harrison, et al, "Requirements for OAM in MPLS Networks,"
        Internet-Draft, Work in Progress, May 2001.

     6  K. Kompella and Y. Rekhter, "Multi-area MPLS Traffic
        Engineering," Internet-Draft, Work in Progress, March 2001.

     7  G. Ash, et al, "Requirements for Multi-Area TE," Internet-Draft,
        Work in Progress, March 2001.

     8  A. Iwata, N. Fujita, G.R. Ash, and A. Farrel, "Crankback Routing
        Extensions for MPLS Signaling," Internet-Draft, Work in Progress,
        July 2001.


   10.  Acknowledgments


   Lai, et al              Category - Expiration                      13
              Network Hierarchy and Multilayer Survivability   July 2001


     A lot of the direction taken in this document, and by the team, was
     steered by the insightful questions provided by Bala Rajagoplan,
     Greg Bernstein, Yangguang Xu, and Avri Doria.  The set of questions
     is attached as Appendix A in this document.


   11. Author's Addresses

     Wai Sum Lai
     AT&T
     200 Laurel Avenue
     Middletown, NJ 07748, USA
     Tel: +1 732-420-3712
     wlai@att.com

     Dave McDysan
     WorldCom
     22001 Loudoun County Pkwy
     Ashburn, VA 20147, USA
     dave.mcdysan@wcom.com

     Jim Boyle
     jimpb@nc.rr.com

     Malin Carlzon
     malin@sunet.se

     Rob Coltun
     rcoltun@redback.com

     Tim Griffin
     AT&T
     180 Park Avenue
     Florham Park, NJ 07932, USA
     Tel: +1 973-360-7238
     griffin@research.att.com

     Ed Kern
     Cogent Communications
     3413 Metzerott Rd
     College Park, MD 20740, USA
     Tel: +1 703-852-0522
     ejk@tech.org

     Tom Reddington
     Lucent Technologies
     67 Whippany Rd
     Whippany, NJ 07981, USA
     Tel: +1 973-386-7291
     treddington@bell-labs.com


   Appendix A: Questions used to help develop requirements

   Lai, et al              Category - Expiration                      14
              Network Hierarchy and Multilayer Survivability   July 2001



     A. Definitions

     1. In determining the specific requirements, the design team should
     precisely define  the concepts "survivability", "restoration",
     "protection", "protection switching", "recovery", "re-routing" etc.
     and their relations. This would enable the requirements doc to
     describe precisely which of these will be addressed.
     In the following, the term "restoration" is used to indicate the
     broad set of policies and mechanisms used to ensure survivability.

     B. Network types and protection modes

     1. What is the scope of the requirements with regard to the types of
     networks covered? Specifically, are the following in scope:

     Restoration of connections in mesh optical networks (opaque or
     transparent)
     Restoration of connections in hybrid mesh-ring networks
     Restoration of LSPs in MPLS networks (composed of LSRs overlaid on a
     transport network, e.g., optical)
     Any other types of networks?
     Is commonality of approach, or optimization of approach more
     important?

     2.  What are the requirements with regard to the protection modes to
     be supported in each network type covered? (Examples of protection
     modes include 1+1, M:N, shared mesh, UPSR, BLSR, newly defined modes
     such as P-cycles, etc.)

     3.  What are the requirements on local span (i.e., link by link)
     protection and end-to-end protection, and the interaction between
     them?  E.g.: what should be the granularity of connections for each
     type (single connection, bundle of connections, etc).

     C. Hierarchy

     1. Vertical (between two network layers):
         What are the requirements for the interaction between
     restoration procedures across two network layers, when these
     features are offered in both layers?  (Example, MPLS network
     realized over pt-to-pt optical connections.) Under such a case,

         (a) Are there any criteria to choose which layer should provide
     protection?

         (b) If both layers provide survivability features, what are the
     requirements to coordinate these mechanisms?

         (c) How is lack of current functionality of cross-layer
     cooridnation currently hampering operations?



   Lai, et al              Category - Expiration                      15
              Network Hierarchy and Multilayer Survivability   July 2001


         (d) Would the benefits be worth additional complexity associated
     with routing isolation (e.g. VPN, areas), security, address
     isolation and policy / authentication processes?

     2. Horizontal (between two areas or administrative subdivisions
     within the same network layer):

         (a) What are the criteria that trigger the creation of protocol
     or administrative boundaries pertaining to restoration? (e.g.,
     scalability?  multi-vendor interoperability? what are the practical
     issues?)  multi-provider? Should multi-vendor necessitate
     hierarchical seperation?

         When such boundaries are defined:

         (b) What are the requirements on how protection/restoration is
     performed end-to-end across such boundaries?

         (c) If different restoration mechanisms are implemented on two
     sides of a boundary, what are the requirements on their interaction?

        What is the primary driver of horizontal hierarchy? (select one)
         - functionality (e.g. metro -v- backbone)
         - routing scalability
         - signaling scalability
         - current network architecture, trying to layer on TE ontop of
           already hiearchical network architecture
         - routing and signalling

        For signalling scalability, is it
         - managability
         - processing/state of network
         - edge-to-edge N^2 type issue

         For routing scalability, is it
         - processing/state of network
         - are you flat and want to go hierarchical
         - or already hierarchical?
         - data or TDM application?

     D. Policy

     1. What are the requirements for policy support during
     protection/restoration,
         e.g., restoration priority, preemption, etc.

     E. Signaling Mechanisms

     1. What are the requirements on the signaling transport mechanism
     (e.g., in-band over sonet/sdh overhead bytes, out-of-band over an IP
     network, etc.) used to communicate restoration protocol
        messages between network elements. What are the bandwidth and
     other requirements on the signaling channels?

   Lai, et al              Category - Expiration                      16
              Network Hierarchy and Multilayer Survivability   July 2001



     2. What are the requirements on fault detection/localization
     mechanisms (which is the prelude to performing restoration
     procedures)  in the case of opaque and transparent optical networks?
     What are the requirements in the case of MPLS restoration?

     3. What are the requirements on signaling protocols to be used in
     restoration procedures (e.g., high priority processing, security,
     etc).

     4. Are there any requirements on the operation of restoration
     protocols?

     F. Quantitative

     1. What are the quantitative requirements (e.g., latency) for
     completing restoration under different protection modes (for both
     local and end-to-end protection)?

     G. Management

     1. What information should be measured/maintained by the control
     plane at each network element pertaining to restoration events?

     2. What are the requirements for the correlation between control
     plane and data plane failures from the restoration point of view?


   Full Copyright Statement

     "Copyright (C) The Internet Society (date). All Rights Reserved.
     This document and translations of it may be copied and furnished to
     others, and derivative works that comment on or otherwise explain it
     or assist in its implmentation may be prepared, copied, published
     and distributed, in whole or in part, without restriction of any
     kind, provided that the above copyright notice and this paragraph
     are included on all such copies and derivative works. However, this
     document itself may not be modified in any way, such as by removing
     the copyright notice or references to the Internet Society or other
     Internet organizations, except as needed for the purpose of
     developing Internet standards in which case the procedures for
     copyrights defined in the Internet Standards process must be
     followed, or as required to translate it into languages other than
     English.

     The limited permissions granted above are perpetual and will not be
     revoked by the Internet Society or its successors or assigns.

     This document and the information contained herein is provided on an
     "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
     TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
     BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
     HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF


   Lai, et al              Category - Expiration                      17
              Network Hierarchy and Multilayer Survivability   July 2001


     MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.





















































   Lai, et al              Category - Expiration                      18