[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: document coming around one more time



Team,
   Here's the revised version promised below.
Thanks, Wai Sum.

-----Original Message-----
From: Ed Kern [mailto:ejk@tech.org]
Sent: Wednesday, July 11, 2001 1:40 PM
To: tewg-dt@ops.ietf.org
Subject: document coming around one more time


-----BEGIN PGP SIGNED MESSAGE-----



team,

the design team document that waisum pushed around this morning was 
discussed on the call today, we have every intention of getting this 
document to the editor by friday.  To do this waisum will be making his 
last minute changes (and possibly putting it into text form) by this 
evening and sending it to the list.  Please read and comment (OR send 
waisum a note that you read it and have no comment) by cob thursday.

thanks,

Ed
-----BEGIN PGP SIGNATURE-----
Version: Mulberry PGP Plugin v2.0
Comment: processed by Mulberry PGP Plugin

iQCVAwUBO0yPfemO/gK7ZGvVAQGTPQQArn85qWExD3lvdS27EEDs/jgOOfC2Fa1K
NgjcuvAaNiAOQhN0InUbyqV586t+jykVPCtXLKR+N0O6Nkbwhg3nvDxdHHUlDglb
6uQsDVkLx1/yEdtJugxCrYTUBuYsI7WKq4msHM7eFRf+TSZ8fwmRaK11G8ASfrfI
vsCUwa/VRA8=
=ktrv
-----END PGP SIGNATURE-----




Traffic Engineering Working Group                     Wai Sum Lai, AT&T
Internet Draft                                   Dave McDysan, WorldCom
Document: <draft-team-tewg-survivability-                  (Co-Editors)
00.txt>
Category: Informational                                       Jim Boyle
                                                          Malin Carlzon
                                                    Rob Coltun, Redback
                                                      Tim Griffin, AT&T
                                                        Ed Kern, Cogent
                                                 Tom Reddington, Lucent

                                                              July 2001


             Network Hierarchy and Multilayer Survivability

Status of this Memo

   This document is an Internet-Draft and is in full conformance with
      all provisions of Section 10 of RFC2026 [1].

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups. Note that
   other groups may also distribute working documents as Internet-
   Drafts. Internet-Drafts are draft documents valid for a maximum of
   six months and may be updated, replaced, or obsoleted by other
   documents at any time. It is inappropriate to use Internet- Drafts
   as reference material or to cite them other than as "work in
   progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.


1. Abstract

   This document is the deliverable out of the Network Hierarchy and
   Survivability Techniques Design Team established within the Traffic
   Engineering Working Group.  This team was requested to try to
   determine what the current and near term requirements are for
   survivability and hierarchy in MPLS networks.  The team determined
   that there appears to be a need for common, interoperable
   survivability approaches in packet and non-packet networks.
   Suggested approaches include path-based as well as one that repairs
   connections in proximity to the network fault.  For clarity, an
   expanded set of definitions is included.  As for hierarchy, there
   did not appear to be as much need for work on “vertical hierarchy”,
   defined as communication between network layers such as TDM/optical
   and IP.  However there does appear to be a pressing need for
   “horizontal hierarchy” in data networks.  This requirement is often
   presented in the context of layer 2 and layer 3 VPN services where

<Lastname>              Category - Expiration                       1

            Network Hierarchy and Multilayer Survivability   July 2001


   SLAs would appear to necessitate signaling from the edges into the
   core of a network.  Issues include potential current protocols
   limitations in networks which are hierarchical (e.g. multi-area
   OSPF) and scalability concerns of potentially O(N^2) connection
   growth in larger networks.

   Please send comments to te-wg@ops.ietf.org


2. Conventions used in this document

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in
   this document are to be interpreted as described in RFC-2119 [2].


3. Introduction

   This document presents a proposal of the tangible requirements for
   network survivability and hierarchy in current service provider
   environments.  With feedback from the working group solicited, the
   objective is to help focus the work that is being addressed in the
   traffic engineering, ccamp and other working groups.  A main goal of
   this work is to provide some expedience for required functionality
   in multi-vendor service provider networks.  The initial focus is
   primarily on intra-domain operations.  However, to maintain
   consistency in the provision of end-to-end service in a multi-
   provider environment, rules governing the operations of
   survivability mechanisms at domain boundaries must also be
   specified.  While such issues are raised and discussed, where
   appropriate, they will not be treated in depth in the initial
   release of this document.

   The document first develops a set of definitions to be used later in
   this document and potentially in other documents as well.  It then
   addresses the requirements and issues associated with service
   restoration, hierarchy, and finally a short discussion of
   survivability in hierarchical context.


4. Definitions

4.1 Hierarchy


   Network hierarchy is an abstraction of part of a network's topology
   and the routing and signaling mechanism needed to support the
   topological abstraction.  Abstraction may be used as a mechanism to
   build large networks or as a technique for enforcing administrative,
   topological or geographic boundaries.  For example, network
   hierarchy might be used to separate the metropolitan and long-haul
   regions of a network or to separate the regional and backbone
   sections of a network [Bert Wijnen], or to interconnect service


Lai, et al              Category - Expiration                       2

            Network Hierarchy and Multilayer Survivability   July 2001


   provider networks (with BGP which reduces a network to an Autonomous
   System).  In this document, network hierarchy is considered from two
   perspectives:
   (1) Horizontally oriented: between two areas or administrative
   subdivisions within the same network layer
   (2) Vertically oriented: between two network layers

   Horizontal hierarchy is the abstraction necessary to allow a network
   at one network layer, for instance a packet network, to grow.
   Examples of horizontal hierarchy include BGP and multi-area OSPF.

   Vertical hierarchy is the abstraction, or reduction in information,
   which would be of benefit when communicating information across
   network layers, as in propagating information between optical and
   router networks.


4.2 Survivability


   Extra traffic is the traffic carried over the protection entity
   while the working entity is active.  Extra traffic is not protected,
   i.e., when the protection entity is required to protect the traffic
   that is being carried over the working entity (e.g., due to a
   failure affecting the working entity), the extra traffic is
   preempted.

   Normalization is the return to the normal state of a network upon
   completing the repair of the network failure.  This could include
   the rerouting of affected traffic to the original working entities
   or new routes.  The term revertive mode is used when traffic is
   returned to the working entity (switch back).

   Protection, also called protection switching, is a survivability
   technique based on predetermined failure recovery: as the working
   entity is established, resources are reserved for the protection
   entity.  These resources may be used by low-priority traffic
   (referred to as extra traffic) if traffic preemption is allowed.
   Depending on the amount of reserved resources, not all of the
   affected traffic may be protected.  (For further discussion of
   concepts related to protection, see the Sub-section below on
   Survivability Concepts.)

   Protection entity (also called back-up entity or recovery entity) is
   the entity that is used to carry protected traffic in protection
   operation mode, i.e., when the working entity is in error or has
   failed.

   Recovery is the sequence of actions taken by a network after the
   detection of a failure to maintain the required performance level
   for existing services (e.g., according to service level agreements)
   and to allow normalization of the network.  The actions include
   notification of the failure followed by two parallel processes: (1)
   a repair process with fault isolation and repair of the failed


Lai, et al              Category - Expiration                       3

            Network Hierarchy and Multilayer Survivability   July 2001


   components, and (2) a reconfiguration process with path selection
   and rerouting for the affected traffic.

   Rerouting is placement of affected traffic from the working entity
   to the protection entity, when the path for the protection entity
   has been selected after the detection of a fault on the working
   entity.  This is synonymous with switch-over in protection
   techniques.  (In [3], rerouting is synonymous with restoration.)

   Restoration is a survivability technique that dynamically discovers
   the alternate path from spare resources in network, or establishes
   new paths on demand, for affected traffic once the failure is
   detected and the affected traffic is identified for rerouting.  The
   new path may be based on preplanned configurations or current
   network status.  Thus, restoration involves a path selection process
   followed by traffic rerouting. (In [3], restoration is referred to
   as recovery by rerouting.)

   Restoration, or more specifically, service restoration, refers to
   the actions taken by a network to maintain service continuity after
   the detection of a failure.  In this second usage, restoration has a
   meaning very similar to recovery, except that restoration covers
   only the reconfiguration process and not the repair process.  Also,
   in this usage, it should be clear from the context that it is
   irrelevant whether the survivability technique used to achieve
   service continuity is based on protection or restoration techniques.

   Restoration time is the time interval from the occurrence of a
   network impairment to the instant when the affected traffic is
   either completely rerouted or until spare resources are exhausted
   and/or no more preemptable traffic to make room.

   Revertive mode is a procedure in which revertive action, i.e.,
   switch back from the protection entity to the working entity, is
   taken once the failed working entity has been repaired.  In non-
   revertive mode, such action is not taken.  To minimize service
   interruption, switch-back in revertive mode should be performed at a
   time when there is the least impact on the traffic concerned, or by
   using the make-before-break concept.

   Shared risk group (SRG) is a set of network elements with a shared
   vulnerability, as defined by a network operator.  For example, a
   shared risk link group (SRLG) is the union of all the links on those
   fibers that are routed in the same physical conduit in a fiber-span
   network.  This concept includes, besides shared conduit, other types
   of compromise such as shared fiber cable, shared right of way,
   shared optical ring, shared office without power sharing, etc.
   Also, the extent of compromise in a shared vulnerability, such as
   the length of the sharing for compromised outside plant, needs to be
   considered.

   Survivability is the capability of a network to maintain service
   continuity in the presence of faults within the network [4].

Lai, et al              Category - Expiration                       4

            Network Hierarchy and Multilayer Survivability   July 2001


   Survivability techniques such as protection and restoration are
   implemented either on a per-link basis, on a per-path basis, or
   throughout an entire network to alleviate service disruption at
   affordable costs.  The degree of survivability is determined by the
   network's capability to survive single failures, multiple failures,
   and equipment failures.

   Working entity is the entity that is used to carry traffic in normal
   operation mode.  Depending on the context, an entity can be, e.g., a
   channel or a transmission link in the physical layer, an LSP in
   MPLS, or a logical bundle of one or more LSPs.

4.3 Survivability Concepts


   In a survivable network design, spare capacity and diversity are
   built into the network from the beginning to support some degree of
   self-healing whenever failures occur.  A common strategy is to
   associate each working entity with a protection entity having either
   dedicated pre-reserved resources or shared resources that are pre-
   reserved or reserved-on-demand.  According to the methods of setting
   up a protection entity, different approaches to providing
   survivability can be classified.  Generally, protection techniques
   are based on having a dedicated protection entity set up prior to
   failure.  Such is not the case in restoration techniques, which
   mainly rely on the use of spare capacity in the network.  Hence, in
   terms of trade-offs, protection techniques usually offer fast
   recovery from failure with enhanced availability, while restoration
   techniques usually achieve better resource utilization.

   Protection techniques can be implemented by several architectures:
   1+1, 1:1, 1:n, and m:n.  In the context of SDH/SONET, they are
   referred to as Automatic Protection Switching (APS).

   In the 1+1 protection architecture, a protection entity is dedicated
   to each working entity.  The dual-feed mechanism is used whereby the
   working entity is permanently bridged onto the protection entity at
   the source of the protected domain.  In normal operation mode,
   identical traffic is transmitted simultaneously on both the working
   and protection entities.  At the sink of the protected domain, both
   feeds are monitored for alarms and maintenance signals.  A selection
   between the working and protection entity is made based on some
   predetermined criteria, such as the transmission performance
   requirements or defect indication.  This architecture is rather
   expensive since resource duplication is required.  It is generally
   used for specific services that need a very high availability.

   In the 1:1 protection architecture, a protection entity is also
   dedicated to each working entity.  The protected traffic is normally
   transmitted by the working entity.  If the working entity has
   failed, the protected traffic is rerouted to the protection entity.
   This architecture is inherently slower in recovering from failure
   than a 1+1 architecture since communication between both ends of the


Lai, et al              Category - Expiration                       5

            Network Hierarchy and Multilayer Survivability   July 2001


   protection domain is required to perform the switch-over operation.
   An advantage is that the protection entity can optionally be used to
   carry preemptable "extra traffic" in normal operation.

   In the 1:n protection architecture, a dedicated protection entity is
   shared by n working entities.  Traffic is normally sent on the
   working entities.  When multiple working entities have failed
   simultaneously, only one of them can be restored by the common
   protection entity.  This contention is resolved by assigning a
   different preemptive priority to each working entity.  As in the 1:1
   case, the protection entity can optionally be used to carry
   preemptable "extra traffic" in normal operation

   The m:n architecture is a generalization of the 1:n architecture.
   Typically m <= n, m dedicated protection entities are shared by n
   working entities.   While this architecture can improve system
   availability with small cost increases, it has rarely been
   implemented or standardized.


5. Survivability


5.1 Scope

   Interoperable approaches to network survivability were determined to
   be an immediate requirement in packet networks as well as in
   SDH/SONET framed TDM networks.  Not as pressing at this time were
   techniques which would cover all-optical networks (e.g., where
   framing is unknown), as the control of these networks in a multi-
   vendor environment appeared to have some other hurdles to first deal
   with.  Also, not of immediate interest were approaches to coordinate
   or explicitly communicate survivability mechanisms across network
   layers (such as from a TDM or optical network to/from an IP
   network).  However, a capability should be provided for a network
   operator to control the operation of survivability mechanisms among
   different layers.  Such issues and those related to OAM are outside
   the scope of this document.  (For proposed MPLS OAM requirements,
   see [5]).

   The types of network failures that cause a restoration to be
   performed include link/span and node failures (which might include
   span failures at lower layers).  Other more complex failure
   mechanisms such as systematic control-plane failure or breach of
   security are not within the scope of the survivability mechanisms
   discussed in this document.

5.2 Required initial set of survivability mechanisms


5.2.1   1:1 Path Protection with Pre-Established Capacity

   In this protection mode, the head end of a working connection
   establishes also a protection connection to the destination.  In


Lai, et al              Category - Expiration                       6

            Network Hierarchy and Multilayer Survivability   July 2001


   normal operation, traffic is only sent on the working connection,
   though the ability to signal that traffic will be sent on both
   connections (1+1 Path for signaling purposes) would be valuable in
   non-packet networks.  Some distinction between working and
   protection connections is likely, either through explicit objects,
   or preferably through implicit methods such as general classes or
   priorities.  Head ends need the ability to create connections that
   are as failure disjoint as possible from each other.  This would
   require SRG information that can be generally assigned to either
   nodes or links and propagated through the control or management
   plane.  In this mechanism, capacity in the protection connection is
   pre-established, however it can be used to carry preemptable extra
   traffic.  Protect capacity is first come first served.  When protect
   capacity is called into service during restoration, there should be
   the ability to promote the protection connection to working status
   (for non-revertive mode operation) with some form of make-before-
   break capability.

5.2.2   1:1 Path Protection with Pre-Planned Capacity

   Similar to the above 1:1 protection with pre-established capacity,
   the protection connection in this case is also pre-signaled.  The
   difference is in the way protect capacity is assigned.  With pre-
   planned capacity, the mechanism supports the ability for the protect
   capacity to be shared, or “double-booked”.  It would be expected
   that should operator predicted failures occur, which potentially
   could rely on enumeration in SRGs, that only a limited set of
   protect connections would be put into service, and that the protect
   capacity available in the network would be able to fulfill this
   traffic (given proper sizing and planning of the network).  In a
   sense, this is 1:1 from a path perspective, however the protect
   capacity in the network (on a link by link basis) is shared in a 1:n
   fashion.  Some form of information propagation could be required
   before traffic may be sent on protection connections, especially in
   TDM networks.  In data networks, a desirable operating approach for
   this mechanism might be where the protect capacity is not accurately
   booked against SRGs (e.g. non-predictive).

   The use of this approach may require more careful planning.  Initial
   deployment might first be based on 1:1 path protection with pre-
   established capacity and the local restoration mechanism to be
   described next.

5.2.3   Local Restoration

   Due to the time impact of signal propagation, path-based approaches
   may not be able to meet the service requirements desired in some
   networks.   The solution to this is to restore connectivity in
   immediate proximity to the fault.  At a minimum, this approach
   should be able to protect against connectivity-type SRGs, though
   protecting against node-based SRGs might be worthwhile.  After local
   restoration is in place, it is likely that head end systems would
   later perform some path-level re-grooming.  Head end systems must

Lai, et al              Category - Expiration                       7

            Network Hierarchy and Multilayer Survivability   July 2001


   have some control as to whether their connections are candidates for
   or excluded from local restoration.

5.2.4   Path Restoration

   In this approach, connections that are impacted by a fault are
   rerouted by the originating network element upon notification of
   connection failure.  This approach does not involve any new
   mechanisms.  It merely is a mention of another common approach to
   protecting against faults in a network.

5.3 Applications Supported

   With service continuity under failure as a goal, a network is
   "survivable" if, in the face of a network failure, connectivity is
   interrupted for a brief period and then restored before the network
   failure ends.  The length of this interrupted period is dependent on
   the application supported.  Here are some typical applications that
   need to be considered:

   - Best-effort data: restoration of network connectivity by rerouting
     at the IP layer would be sufficient
   - Premium data service: need to meet TCP or application protocol
     timer requirements
   - Voice: call cutoff is in the range of 140 msec to 2 sec
   - Other real-time service (e.g., streaming, fax)
   - Mission-critical applications

5.4 Timing Bounds for Service Restoration

   The approach to picking the types of survivability mechanisms
   recommended was to consider a spectrum of mechanisms that can be
   used to protect traffic with varying characteristics of
   survivability and speed of restoration, and then attempt to select a
   few general points which provide some coverage across that spectrum.
   The focus of this work is to provide requirements to which a small
   set of detailed proposals may be developed, allowing the operator
   some (limited) flexibility in approaches to meeting their design
   goals in engineering multi-vendor networks.  Requirements of
   different applications as listed in the previous sub-section were
   discussed generally, however none on the team would likely attest to
   the scientific merit of the ability of the timing bounds below to
   meet any specific application’s needs.  A few assumptions include:

   Approaches that protection switch without propagation of information
   are likely to be faster than those that do require some form of
   fault notification to some or all elements in a network.
   Approaches that require some form of signaling after a fault will
   also likely suffer some timing impact.

   Proposed timing bounds for service restoration for different
   mechanisms are as follows (all bounds are exclusive of signal
   propagation):

Lai, et al              Category - Expiration                       8

            Network Hierarchy and Multilayer Survivability   July 2001



   1:1 path protection with pre-established capacity:   100-500 ms
   1:1 path protection with pre-planned capacity:       100-750 ms
   Local restoration:                                   50 ms
   Path restoration:                                    1-5 seconds

   To ensure that the service requirements for different applications
   can be met within the above timing bounds, restoration priority is
   used to determine the order in which connections are restored (to
   minimize service restoration time as well as to gain access to
   available spare capacity).  Preemption priority should only be used
   in the event that all connections cannot be restored, in which case
   connections with lower preemption priority should be released.
   Depending on a service provider's strategy in provisioning network
   resources for backup, preemption may or not be needed in the
   network.

5.5 Coordination Among Layers

   A common design goal for multi-layered networks is to provide the
   desired level of service in the most cost effective manner.  The use
   of multilayer survivability might allow the optimization of spare
   resources through the improvement of resource utilization by sharing
   spare capacity across different layers, though further
   investigations are needed.  Coordination during service restoration
   among different network layers (e.g. IP, SDH/SONET, optical layer)
   might necessitate development of vertical hierarchy.  The benefits
   of providing survivability mechanisms at multiple layers, and the
   optimization of the overall approach, must be weighed with the
   associated cost and service impacts.

   It was felt that the current approach to coordination of
   survivability approaches currently did not have significant
   operational shortfalls.  These approaches include protecting traffic
   solely at one layer (e.g. at the IP layer over linear WDM, or at the
   SDH/SONET layer).  Where survivability mechanisms might be deployed
   at several layers, such as when a routed network rides a SDH/SONET
   protected network, it was felt that current coordination approaches
   were sufficient.  However, note that failures within a layer can be
   guarded against by techniques either in that layer or at a higher
   layer, but not in reverse.  Thus, the optical layer cannot guard
   against failures in the IP layer such as router system failures,
   line card failures.

   A default coordination mechanism for inter-layer interaction could
   be the use of nested timers and current SDH/SONET fault monitoring,
   as has been done traditionally for backward compatibility.  Thus,
   when lower-layer restoration happens in a longer time period than
   higher-layer restoration, a hold-off timer is utilized to avoid
   contention between the different single-layer recovery schemes.  In
   other words, multi-layer interaction is addressed by having
   successively higher multiplexing levels operate at restoration time
   scale greater than the next lowest layer.  Currently, if SONET

Lai, et al              Category - Expiration                       9

            Network Hierarchy and Multilayer Survivability   July 2001


   protection switching is used, MPLS recovery timers must wait until
   SONET has had time to switch.

5.6 Evolution Toward IP Over Optical

   As more pressing requirements for survivability and horizontal
   hierarchy for edge-to-edge signaling are met with technical
   proposals, it is believed that the benefits of merging (in some
   manner) the control planes of multiple layers will be outlined.
   When these benefits are self-evident, it would then seem to be the
   right time to review if vertical hierarchy mechanisms are needed,
   and what the requirements might be.

6. Hierarchy Requirements

   Efforts in the area of network hierarchy should focus on mechanisms
   that would allow more scalable edge-to-edge signaling, or signaling
   across networks with existing network hierarchy (such as multi-area
   OSPF).  This would appear to be a more immediate need than
   mechanisms that might be needed to interconnect networks at
   different layers.

6.1 Historical Context

   One reason for horizontal hierarchy is functionality (e.g., metro
   versus backbone).  Geographic “islands” reduce the need for
   interoperability and make administration and operations less
   complex.  Using a simpler, more interoperable, survivability scheme
   at metro/backbone boundaries is natural for many provider network
   architectures.  In transmission networks, creating geographic
   islands of different vendor equipment has been done for a long time
   because multi-vendor interoperability has been difficult to achieve.
   Traditionally, providers have to coordinate the equipment on either
   end of a "connection," and making this interoperable reduces
   complexity.  A provider should be able to concatenate survivability
   mechanisms in order to provide a "protected link" to the next higher
   level.  Think of SDH/SONET rings connecting to TDM DXCs with 1+1
   line-layer protection between the ADM and the DXC port.  The TDM
   connection, e.g., a DS3 is protected, but usually all equipment on
   each SDH/SONET ring is from a single vendor.  The DXC cross
   connections are controlled by the provider and the ports are
   physically protected resulting in a highly available design.  Thus,
   concatenation of survivability approaches can be used to cascade
   across horizontal hierarchy.  While not perfect, it is workable.

   While the problems associated with multi-vendor interoperability may
   necessitate horizontal hierarchy as a practical matter (at least
   this has been the case in TDM networks), there may be no technical
   reason for it.  Members of the team with more experience on IP
   networks felt there should be no need for this in core networks, or
   even most access networks.



Lai, et al              Category - Expiration                      10

            Network Hierarchy and Multilayer Survivability   July 2001


   Some of the largest service provider networks currently run a single
   area/level IGP.  Some service providers, as well as many large
   enterprise networks, run multi-area OSPF to gain increases in
   scalability.  Often, this was from an original design, so it is
   difficult to say if the network truly required the hierarchy to
   reach its current size.

   Some proposals on improved mechanisms to address network hierarchy
   have been suggested [6, 7, 8].  This document aims to provide the
   concrete requirements so that these and other proposals can first
   aim to meet some limited objectives.

6.2 Applications for Horizontal Hierarchy

   A primary driver for intra-domain horizontal hierarchy is signaling
   scalability in the context of edge-to-edge VPNs, potentially across
   traffic-engineered data networks.  There are a number of different
   approaches to VPNs and they are currently being addressed by
   different emerging protocols: RFC 2547bis BGP/MPLS VPNs, provider-
   provisioned VPNs based upon MPLS tunnels (e.g., virtual routers),
   Pseudo Wire Edge-to-edge Emulation (PWE3), etc.  These may or not
   need explicit signaling from edge to edge, but it is a common
   perception that in order to meet SLAs, some form of edge-to-edge
   signaling is required.

   For signaling scalability, there are probably two types of network
   scenarios to consider:

   - Large SP networks with flat routing domains where edge-to-edge
     (MPLS) signaling as implemented today would probably not scale.
   - Networks which would like to signal edge-to-edge, and might even
     scale in a limited application. However, they are hierarchically
     routed (e.g. OSPF areas) and current implementations, and
     potentially standards prevent signaling across areas.  This
     requires the development of signaling standards that support
     dynamic establishment and potentially restoration of LSPs across a
     2-level IGP hierarchy.

   Scalability is concerned with the O(N^2) properties of edge-to-edge
   signaling.  For a large network, maintaining a "connection" between
   every edge is simply not scalable.  Even if establishing and
   maintaining connections is feasible, there might be an impact on
   core survivability mechanisms which would cause restoration times to
   grow with N^2, which would be undesirable.  While some value of N
   may be inevitable, approaches to reduce N (e.g. to pull in from the
   edge to aggregation points) might be of value.

   For routing scalability, especially in data applications, a major
   concern is the amount of processing/state that is required in the
   variety of network elements.  If some nodes might not be able to
   communicate and process the state of every other node, it might be
   preferable to limit the information.  There is one way of thought
   that says that the amount of information contained by a horizontal

Lai, et al              Category - Expiration                      11

            Network Hierarchy and Multilayer Survivability   July 2001


   barrier should be significant, and that impacts this might have on
   optimality in route selection and ability to provide global
   survivability are accepted tradeoffs.

6.3 Horizontal Hierarchy Requirements

   Mechanisms are required to allow for edge-to-edge signaling of
   connections through a network.  The types of network scenarios
   include large networks with a large number of edge devices and flat
   interior routing, as well as medium to large networks which
   currently have hierarchical interior routing such as multi-area OSPF
   or multi-level IS-IS.  The primary context of this is edge-to-edge
   signaling which is thought to be required to assure the SLAs for the
   layer 2 and layer 3 VPNs that are being carried across the network.
   Another possible context would be edge-to-edge signaling in TDM
   SDH/SONET networks, where metro and core networks again might either
   be in a flat or hierarchical interior routing domain.

7. Survivability and Hierarchy

   When horizontal hierarchy exist in a network layer, a question
   arises as to how survivability can be provided along a connection
   which crosses hierarchical boundaries.

   In designing protocols to meet the requirements of hierarchy, an
   approach to consider is that boundaries are either clean, or are of
   minimal value.  However, the concept of network elements that
   participate on both sides of a boundary might be a consideration
   (e.g. OSPF ABRs).  That would allow for devices on either side to
   take an intra-area approach within their region of knowledge, and
   for the ABR to do this in both areas, and splice the two protected
   connections together at a common point (granted it is a common point
   of failure now).  If the limitations of this approach start to
   appear in operational settings, then perhaps it would be time to
   start thinking about route-servers and signaling propagated
   directives.  However, one initial approach might be to signal
   through a common border router, and to consider the service as
   protected as it consist of a concatenated set of connections which
   are each protected within their area.  Another approach might be to
   have a least common denominator mechanism at the boundary, e.g., 1+1
   port protection.  There should also be some standardized means for a
   survivability scheme on one side of such a boundary to communicate
   with the scheme on the other side regarding the success or failure
   of the service restoration action.  For example, if a part of a
   "connection" is down on one side of such a boundary, there is no
   need for the other side to recover from failures.

   In summary, at this time, approaches that allow concatenation of
   survivability schemes across hierarchical boundaries should provide
   sufficient.


8. Security Considerations

Lai, et al              Category - Expiration                      12

            Network Hierarchy and Multilayer Survivability   July 2001



   Security is considered in this initial version.


9. References


   1  Bradner, S., "The Internet Standards Process -- Revision 3", BCP
      9, RFC 2026, October 1996.

   2  Bradner, S., "Key words for use in RFCs to Indicate Requirement
      Levels", BCP 14, RFC 2119, March 1997

   3  V. Sharma, B. Crane, K. Owens, C. Huang, F. Hellstrand, J. Weil,
      L. Andersson, B. Jamoussi, B. Cain, S. Civanlar, and A. Chiu,
      "Framework for MPLS-based Recovery," Internet-Draft, Work in
      Progress, March 2001.

   4  D.O. Awduche, A. Chiu, A. Elwalid, I. Widjaja, and X. Xiao, "A
      Framework for Internet Traffic Engineering," Internet-Draft, Work
      in Progress, May 2001.

   5
      N. Harrison, et al, "Requirements for OAM in MPLS Networks,"
      Internet-Draft, Work in Progress, May 2001.
    
   6
      K. Kompella and Y. Rekhter, "Multi-area MPLS Traffic
      Engineering," Internet-Draft, Work in Progress, March 2001.
    
   7
      G. Ash, et al, "Requirements for Multi-Area TE," Internet-Draft,
      Work in Progress, March 2001.
    
   8
      A. Iwata, N. Fujita, G.R. Ash, and A. Farrel, "Crankback Routing
      Extensions for MPLS Signaling," Internet-Draft, Work in Progress,
      July 2001.


10.  Acknowledgments

   A lot of the direction taken in this document, and by the team, was
   steered by the insightful questions provided by Bala Rajagoplan,
   Greg Bernstein, Yangguang Xu, and Avri Doria.  The set of questions
   is attached as Appendix A in this document.


11. Author's Addresses

   Wai Sum Lai
   AT&T
   200 Laurel Avenue
   Middletown, NJ 07748, USA
   Tel: +1 732-420-3712
   wlai@att.com


Lai, et al              Category - Expiration                      13

            Network Hierarchy and Multilayer Survivability   July 2001


   Dave McDysan

   Jim Boyle
   jimpb@nc.rr.com

   Malin Carlzon

   Rob Coltun

   Tim Griffin

   Ed Kern
   Cogent Communications
   3413 Metzerott Rd
   College Park, MD 20740, USA
   Tel: +1 703-852-0522
   ejk@tech.org

   Tom Reddington


Appendix A: Questions used to help develop requirements

   A. Definitions

   1. In determining the specific requirements, the design team should
   precisely define  the concepts "survivability", "restoration",
   "protection", "protection switching", "recovery", "re-routing" etc.
   and their relations. This would enable the requirements doc to
   describe precisely which of these will be addressed.
   In the following, the term "restoration" is used to indicate the
   broad set of policies and mechanisms used to ensure survivability.

   B. Network types and protection modes

   1. What is the scope of the requirements with regard to the types of
   networks covered? Specifically, are the following in scope:

   Restoration of connections in mesh optical networks (opaque or
   transparent)
   Restoration of connections in hybrid mesh-ring networks
   Restoration of LSPs in MPLS networks (composed of LSRs overlaid on a
   transport network, e.g., optical)
   Any other types of networks?
   Is commonality of approach, or optimization of approach more
   important?

   2.  What are the requirements with regard to the protection modes to
   be supported in each network type covered? (Examples of protection
   modes include 1+1, M:N, shared mesh, UPSR, BLSR, newly defined modes
   such as P-cycles, etc.)



Lai, et al              Category - Expiration                      14

            Network Hierarchy and Multilayer Survivability   July 2001


   3.  What are the requirements on local span (i.e., link by link)
   protection and end-to-end protection, and the interaction between
   them?  E.g.: what should be the granularity of connections for each
   type (single connection, bundle of connections, etc).

   C. Hierarchy

   1. Vertical (between two network layers):
       What are the requirements for the interaction between
   restoration procedures across two network layers, when these
   features are offered in both layers?  (Example, MPLS network
   realized over pt-to-pt optical connections.) Under such a case,

       (a) Are there any criteria to choose which layer should provide
   protection?

       (b) If both layers provide survivability features, what are the
   requirements to coordinate these mechanisms?

       (c) How is lack of current functionality of cross-layer
   cooridnation currently hampering operations?

       (d) Would the benefits be worth additional complexity associated
   with routing isolation (e.g. VPN, areas), security, address
   isolation and policy / authentication processes?

   2. Horizontal (between two areas or administrative subdivisions
   within the same network layer):

       (a) What are the criteria that trigger the creation of protocol
   or administrative boundaries pertaining to restoration? (e.g.,
   scalability?  multi-vendor interoperability? what are the practical
   issues?)  multi-provider? Should multi-vendor necessitate
   hierarchical seperation?

       When such boundaries are defined:

       (b) What are the requirements on how protection/restoration is
   performed end-to-end across such boundaries?

       (c) If different restoration mechanisms are implemented on two
   sides of a boundary, what are the requirements on their interaction?

      What is the primary driver of horizontal hierarchy? (select one)
       - functionality (e.g. metro -v- backbone)
       - routing scalability
       - signaling scalability
       - current network architecture, trying to layer on TE ontop of
         already hiearchical network architecture
       - routing and signalling

      For signalling scalability, is it
       - managability

Lai, et al              Category - Expiration                      15

            Network Hierarchy and Multilayer Survivability   July 2001


       - processing/state of network
       - edge-to-edge N^2 type issue

       For routing scalability, is it
       - processing/state of network
       - are you flat and want to go hierarchical
       - or already hierarchical?
       - data or TDM application?

   D. Policy

   1. What are the requirements for policy support during
   protection/restoration,
       e.g., restoration priority, preemption, etc.

   E. Signaling Mechanisms

   1. What are the requirements on the signaling transport mechanism
   (e.g., in-band over sonet/sdh overhead bytes, out-of-band over an IP
   network, etc.) used to communicate restoration protocol
      messages between network elements. What are the bandwidth and
   other requirements on the signaling channels?

   2. What are the requirements on fault detection/localization
   mechanisms (which is the prelude to performing restoration
   procedures)  in the case of opaque and transparent optical networks?
   What are the requirements in the case of MPLS restoration?

   3. What are the requirements on signaling protocols to be used in
   restoration procedures (e.g., high priority processing, security,
   etc).

   4. Are there any requirements on the operation of restoration
   protocols?

   F. Quantitative

   1. What are the quantitative requirements (e.g., latency) for
   completing restoration under different protection modes (for both
   local and end-to-end protection)?

   G. Management

   1. What information should be measured/maintained by the control
   plane at each network element pertaining to restoration events?

   2. What are the requirements for the correlation between control
   plane and data plane failures from the restoration point of view?


Full Copyright Statement



Lai, et al              Category - Expiration                      16

            Network Hierarchy and Multilayer Survivability   July 2001


   "Copyright (C) The Internet Society (date). All Rights Reserved.
   This document and translations of it may be copied and furnished to
   others, and derivative works that comment on or otherwise explain it
   or assist in its implmentation may be prepared, copied, published
   and distributed, in whole or in part, without restriction of any
   kind, provided that the above copyright notice and this paragraph
   are included on all such copies and derivative works. However, this
   document itself may not be modified in any way, such as by removing
   the copyright notice or references to the Internet Society or other
   Internet organizations, except as needed for the purpose of
   developing Internet standards in which case the procedures for
   copyrights defined in the Internet Standards process must be
   followed, or as required to translate it into languages other than
   English.

   The limited permissions granted above are perpetual and will not be
   revoked by the Internet Society or its successors or assigns.

   This document and the information contained herein is provided on an
   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.






























Lai, et al              Category - Expiration                      17