[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: cleaned up question sheet



Here are my answers to the questions. This was a thought provoking exercise.
Thanks, Jim, for cleaning up the questions to make this easier.

I think that we need to agree on some more terminology than that mentioned
in A.1. I had to make some definitions to answer the questions, as noted
below. Also, thinking about the questions raised even more questions.
Hopefully, I did not raise more questions than I answered! ;^)

Dave

> -----Original Message-----
> From: owner-tewg-dt@ops.ietf.org [mailto:owner-tewg-dt@ops.ietf.org]On
> Behalf Of Jim Boyle
> Sent: Monday, July 02, 2001 11:57 AM
> To: tewg-dt@ops.ietf.org
> Subject: cleaned up question sheet
>
>
>
> took a pass at cleaning up the question sheet, no new content.
>
> ------* snip here *--------
> A. Definitions
>
> 1. In determining the specific requirements, the design team should
>    precisely define  the concepts "survivability", "restoration",
>    "protection", "protection switching", "recovery", "re-routing"
>    etc. and their relations. This would enable the requirements doc to
>    describe precisely which of these will be addressed.
>
>    In the following, the term "restoration" is used to indicate the broad
>    set of policies and mechanisms used to ensure survivability.

Ron Bonica of WorldCom provided the following definitions. I believe that we
should start with those from Wai Lai's definitions and see if the following
or the defintions from the MPLS recovery framework, as Jim Boyle proposed,
add any meaning or clarification. In my opinion, the MPLS-recovery framework
definitions are not general enough for the scope defined in B.1 and do not
define all of the terms in the above list.

Restoration - A network's ability to re-establish connectivity, without
human intervention, after the failure of a network element or network link.
The period after which connectivity must be restored is subject to
definition by a service level agreement. Also, the degree to which the
restored service may be degraded is also subject do definition by a service
level agreement.

Survivability - The ability to maintain connectivity during network
failures. The degree to which connectivity can be degraded during a network
failure is subject to definition by a service level agreement. Levels of
degradation are defined in terms of throughput, latency and jitter.

The types of failures during which connectivity must be maintained is also
subject to definition by a service level agreement. For example, a service
level agreement might require survivability in the face of a single trunk
failure, but not in the case of a double trunk failure.

A can be called "survivable" if, in the face of a network failure,
connectivity is interrupted for a brief period and then restored before the
network failure ends. The period during which connectivity can be
interrupted is subject to definition by a service level agreement.

>
>
> B. Network types and protection modes
>
> 1. What is the scope of the requirements with regard to the types
>     of networks covered? Specifically, are the following in scope:
>
>     -  Restoration of connections in mesh optical networks
>        (opaque or transparent)

Yes, here I interpret connection as the optical carrier or the electronic
framing (e.g., SONET/SDH).

>     -  Restoration of connections in hybrid mesh-ring networks

Yes, here I interpret connection as the optical carrier or the electronic
framing (e.g., SONET/SDH).

>     -  Restoration of LSP connections in MPLS networks (composed of LSRs
                            ^^^^^^^^^^
> overlaid on a transport network, e.g., optical)

Absolutely! Is an LSP a type of connection, as proposed above? Do we want to
say "implemented by" instead of "composed of?"

>     -  Any other types of networks?

Yes. What about IP routed traffic?

>     -  Is commonality of approach, or optimization of approach
> more important?

I think that optimization of the overall approach is more important. That
is, restoration at the IP, MPLS LSP, and physical connection level needs to
be at least coordinated with automatic discovery and/or configuration
controls that would allow a service provider to economically optimize
deployment of restoration across a range of network ownership, lease,
partnership arrangements.

>
> 2.  What are the requirements with regard to
>      the protection modes to be supported in each network type covered?
>      (Examples of protection modes include 1+1, M:N, shared mesh,
>      UPSR, BLSR, newly defined modes such as P-cycles, etc.)

I think that this question should be restructured to align with question
B.1. That is, optical, SONET/SDH, MPLS, IP (if we agree to add this level).
Also, B.1 mentions "restoration," while this section talks about
"protection."

Optical protection/restoration standards defined by industry bodies should
be followed. These must cover at a minimum, fault detection, signaling, and
reversion.

I think that we need to state requirements in terms of the SONET/SDH
"connection" systems that have protection in Linear or Ring configurations.
I'm not an expert in this area, but the following type of taxonmy might be
useful
- 1:1, 1+1, 1:N Linear Line systems
- Unidirectional/Bidirectional Line/Path Switched Rings 2/4 wires (6
combinations)

The routers or Label Switching Routers should follow the SONET/SDH failure
indication, switchover, reversion, and monitoring requirements for the above
cases.

I'm not aware of a SONET/SDH M:N protection standard. Sometimes this term is
used to refer to mesh restoration, not protection.

I'm not familiar with P-cyles. Is there a reference that we can add to the
questionaire?

I think that fast-restoration (or M:N protection) at the MPLS level is very
important.

>
> 3.  What are the requirements on local span (i.e., link by link)
>      protection and end-to-end protection, and the interaction
> between them?
>      E.g.: what should be the granularity of connections for
>      each type (single connection, bundle of connections, etc).

Here is more vague terminology. What is a "span?" Is this like a SONET
section or line? That is, is a span the "connection" between a pair of nodes
at the next level down in the transmission/multiplexing hierarchy? Also, we
need to define what an "end" point is in our context. Given my
interpretation of a connection in B.1, here is my answer.

Granularity
Optical: Transparent Wavelength or overall framed signal rate (e.g., OC-N)
SONET/SDH: Overall optical rate or a framed section of the payload (e.g.,
STS-M)
MPLS: Individual or set of LSPs.

Traditionally, restoration strategies have addressed interaction by having
sucessively higher multiplexing levels operate at restoration time scale
greater than the next lowest layer. This is not feasible if one desires very
rapid MPLS restoration. Therefore, a fundamental requirement is to provide a
means for a provider to control the operation of restoration/protection
between these levels in a mannner that IS NOT determined by such nested
timers.

>
> C. Hierarchy
>
> 1. Vertical (between two network layers):
>     What are the requirements for the interaction between restoration
>     procedures across two network layers, when these features are
>     offered in both layers?
>     (Example, MPLS network realized over pt-to-pt
>     optical connections.) Under such a case,
>
>     (a) Are there any criteria to choose which layer should provide
>           protection?
>
Granularity, priority, preemptibilyt are some important criteria.

>     (b) If both layers provide survivability features, what are the
>           requirements to coordinate these mechanisms?
>
A default coordination mechanism could be the use of nested timers, as has
been done traditionally for backward compatibility. However, there should be
configuration options to change the values of these timers and/or disable
protection/restoration at various levels. Ideally, there would be an
automatic means for higher levels to learn about what has been configured at
a lower level.

>     (c) How is lack of current functionality of cross-layer
> 	  cooridnation currently hampering operations?
>
Currently, if SONET protection switching is used, MPLS recovery timers must
wait until SONET has had time to switch.

>     (d) Would the benefits be worth additional complexity associated
>           with routing isolation (e.g. VPN, areas), security, address
>           isolation and policy / authentication processes?

A driving benefit would be economic. I'm not aware of public studies that
quantify the economic benefit of hierarchical restoration for IP networks.
Economic benefits are driven by not only theoretical efficiency of an
approach, and not as much so by equipment and physical facility costs, but
more so by operational costs. If the additional complexity made the set of
networks easier to operate, this would be a significant benefit.

>
>
> 2. Horizontal (between two areas or administrative subdivisions within
>     the same network layer):
>
>     (a) What are the criteria that trigger the creation of protocol or
>           administrative boundaries pertaining to restoration? (e.g.,
>           scalability?  multi-vendor interoperability? what are the
>           practical issues?)  multi-provider? Should multi-vendor
>           necessitate hierarchical seperation?

Creating geographic "islands" of different vendor equipment has been done in
transmission networks for a long time because multi-vendor interoperability
has been difficult to achieve. "Islands" reduce the need for
interoperability and also make administration and operations less complex.
In an ideal world, interoperability is highly desirable; however, as a
practical matter, I would like to see support for an alternative to full
interoperability.

A provider should be able to concatenate protection/restoration mechanisms
in order to provide a "protected link" to the next higher level. Think of
SONET rings connecting to TDM DXCs with 1+1 line level restoration between
the ADM and the DXC port. The TDM connection, e.g., a DS3 is protected, but
usually all equipment on each SONET ring is from a single vendor. The DXC
cross connections are controlled by the provider and the ports are
physically protected resulting in a highly available design.

Traditionally, providers have to coordinate the equipment on either end of a
"connection," and making this interoperable reduces complexity. In the early
days of any protection/restoration technology, I fear that this practice
will likely continue.

>
>     When such boundaries are defined:
>
>     (b) What are the requirements on how protection/restoration is
>           performed end-to-end across such boundaries?

There should be a standard means to configure, discover, and/or coordinate
which level is performing a particular type of protection/restoration at
each boundary. A protocol that allowed this information to be collected and
assembled automatically would be quite valueable to operations.

>
>     (c) If different restoration mechanisms are implemented on two
>           sides of a boundary, what are the requirements on their
>           interaction?

I believe that there needs to be some restoration mechanism at the boundary
that is a least common denominator, for example, 1+1 port protection for a
highly-reliable service.

There should also be some standardized means for a protection/restoration
scheme on one side of such a boundary to communicate with the scheme on the
other side regarding the success or failure of the protection/restoration
action. For example, if a part of a "connection" is down on one side of such
a boundary, there is no need for the other side to restore failures.
>
>    What is the primary driver of horizontal hierarchy? (select one)
>     - functionality (e.g. metro -v- backbone)
>     - routing scalability
>     - signalling scalability
>     - current network architecture, trying to layer on TE ontop of
>       already hiearchical network architecture
>     - routing and signalling

Functionality. Geographic islands reduce the need for interoperability.
Using a simpler, more interoperable, protecion/restoration scheme at
metro/backbone boundaries is natural for many provider network
architectures.
>
>    For signalling scalability, is it
>     - managability
>     - processing/state of network
>     - edge-to-edge N^2 type issue
I assumed that this was "select one" also. Suggest that we invite the
respondent to explain their answer, as I have tried to do.

Edge-to-edge N^2 type issue. For a large network, maintaining a "connection"
between every edge ("end"?) is simply not scalable and can result in very
inefficient designs if the "connection" has large discrete increments (e.g.,
lambda, OC-N). The processing for establishing each connection is not the
driving factor, it is the re-signaling of connections impacted by a failure
that creates a tremendous spike in signaling traffic. If network devices
process signaling messages at a maximum rate, then restoration time is
proporational to the number of connections. Having restoration time grow
proportional to O(N^2) is highly undesirable.

>
>     For routing scalability, is it
>     - processing/state of network
>     - are you flat and want to go hierarchical
>     - or already hierarchical?
>     - data or TDM application?

Processing/state of network. If every node at a certain
protection/restoration level must communicate and process the state of every
other node, similar arguments to that of signaling apply in terms of
scalability. Furthermore, although aggregating routing information may be
suboptimal, it may not be that much of a penalty if protocol mechanisms are
available to expose the important details across "horizontal" network
regions.

>
> D. Policy
>
> 1. What are the requirements for policy support during
> protection/restoration,
>     e.g., restoration priority, preemption, etc.
>
It is desirable to have a restoration and preemption priority. Ideally,
restoration priority should determine the order in which connections are
restored. Preemption priority should only be used in the event that all
connections cannot be restored, in which case connections with lower
preemption priority should be released. Types of connections that are
preemptible are those used by providers as donated research networks, test
networks, or connections allocated to networks that have other means of
restoration. Restoration priority would be useful if it results in a
restoration time on the order of current SONET/SDH protection systems for
the highest priority connections.

> E. Signaling Mechanisms
>
> 1. What are the requirements on the signaling transport mechanism
>    (e.g., in-band over sonet/sdh overhead bytes, out-of-band over
>    an IP network, etc.) used to communicate restoration protocol
>    messages between network elements. What are the bandwidth and
>    other requirements on the signaling channels?

Using a mechanism that is inherently part of a connection (e.g., overhead
bytes in SONET/SDH) is desirable in that it eliminates the potential for
misconfiguration that exists in out-of-band signaling systems. The capacity
of the signaling channel should be sized for the worst-case restoration
scenario that the mechanism is specified to support. For example, in a
network with a maximum of C restorable connections traversing a "span," with
a restoration time objective of T seconds, the signaling rate should be C/T
restored connections per second.

>
> 2. What are the requirements on fault detection/localization mechanisms
>    (which is the prelude to performing restoration procedures)
>    in the case of opaque and transparent optical networks?

In a time nested restoration paradigm, these become very small for an opaque
network when compared with SONET/SDH detection intervals, which, I believe,
are not fully interoperable at this low level. In any event, these probably
need to be as fast as possible.

>    What are the requirements in the case of MPLS restoration?

In my opinion, if fast MPLS restoration could get down to several hundred
milliseconds, this would be close enough to meet the vast majority of the
application requirements that currently specify 50 ms SONET/SDH restoration
times.

>
> 3. What are the requirements on signaling protocols to be used in
>    restoration procedures (e.g., high priority processing, security, etc).
>
Since the restoration/protection of connections provides the basic
connectivity upon which IP routing must run, I think that it should have a
higher priority. In the event of a massive connection failure followed
shortly thereafter by an IP route flapping incident, which should the route
processor work on first? If intermediate connection restoration action
processing were interleaved with route updates, the route flapping effect
would likely be exacerbated or extended.

> 4. Are there any requirements on the operation of restoration protocols?

As mentioned earlier, they must be configurable. They must also be capable
of being monitored. For example, there should be some indication of
connections that a network level (or a hierarchical subdivision thereof)
fails to restore. There should also be some means to control reversion.
Automatic reversion to a more optimal state is desirable, but there should
be a way to disable it for specific connections. For example, if maintenance
is occuring on nodes and/or facilities through which a connection passes,
operations personnel need some way to keep the traffic off of the facilities
that people are working on.
>
> E. Quantitative
>
> 1. What are the quantitative requirements (e.g., latency) for completing
>    restoration under different protection modes (for both local and
>    end-to-end protection)?
>
I don't think "latency" is a good term to use here. Is "recovery time"
better?

I think that the SONET/SDH 50 ms protection switching is a benchmark that
each level must either provide an option for, or else come close to it
(e.g., 100s of ms for MPLS).

Assuming that restoration is being asked about here, the next time scale is
the 2+/-0.5 seconds when TDM equipment generates AIS and TDM devices start
taking restoration action. So, schemes that can recover the next level of
restoration priority in a few seconds have use. This is also a time scale
where most applications will not time out.

After a few seconds, lower priority restoration that could take a few
minutes may be of use. This could fit the Internet paradigm where if
something doesn't work (or stops working) but works again when tried again
in a few minutes, then a customer is not likely to complain.

Finally, there are the lowest restoration priority (or those connections
that are preempted) which may not be restored for hours (usually after the
failed node, line, or software is repaired). Presumably, the charge for this
level of service (or lack thereof) is substantially less than for a service
with faster restoration characteristics and thereby higher availability.

> F. Management
>
> 1. What information should be measured/maintained by the control plane at
>     each network element pertaining to restoration events?
>
Per connection
Active state: ideally this should be verified periodically somehow
Failed state: detected fault and type
Restoration status: Primary/backup, trying to restore, preempted

Per node:
Priority/preemption level as coordinated with other nodes in (sub)network
Restoration time by priority

> 2. What are the requirements for the correlation between control plane
>     and data plane failures from the restoration point of view?
>
As stated earlier, some mechanism that prevents (or at least detects)
configuration errors regarding which instance or identifier of a control
plane protocol is assocaited with a specific data plane is a fundamental
requirement.