[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Questions on RSVP-TE Graceful Restart and the new Extensions

To: "Adrian Farrel" <adrian@olddog.co.uk>, "Ccamp (E-mail)" <ccamp@ops.ietf.org>
Subject: RE: Questions on RSVP-TE Graceful Restart and the new Extensions
From: "Bardalai, Snigdho" <Snigdho.Bardalai@us.fujitsu.com>
Date: Mon, 8 Oct 2007 16:29:15 -0500
In-reply-to: <046401c80809$dfb94cd0$5102010a@your029b8cecfe>

Hi Adrian,

My comments below...

Thanks,
Snigdho

-----Original Message-----
From: owner-ccamp@ops.ietf.org [mailto:owner-ccamp@ops.ietf.org]On
Behalf Of Adrian Farrel
Sent: Saturday, October 06, 2007 6:12 AM
To: Bardalai, Snigdho; Ccamp (E-mail)
Subject: Re: Questions on RSVP-TE Graceful Restart and the new
Extensions


Hi Snigdho,

Always good to have reports of use of the more advanced functions.

Some thoughts...

We can do some work to fix up the SRefresh behavior, but I don't think it 
actually helps, because if Summary Refresh is not being used, exactly the 
same scenario can arise with a Path Refresh. In other words, fixing the 
SRefresh would not fix the problem.

[SCB] We had thought about such an option but as you mention the solution would 
      have to consider the full refresh scenario as well.

In some circumstances, the window in the scenario you have drawn is very 
small.. The SRefresh *and* the Path Refresh must be sent by N1 between N2 
completing restart and N2's first Hello arriving at N1. One could say that 
N2 should ignore all received messages until it has Hellos up and running. 
That would guarantee that N1 knew about the restart.

That simply requires that when startup completes N2 must:
- either
    - immediately send a new Hello to each neighbor
  or
     - respond to any first received message from a
       neighbor with which it does not have an active
       Hello exchange by sending a Hello
- ignore all subsequent RSVP messages except Hellos
  from neighbors with which it does not have an active
  Hello exchange

[SCB] This is a good and robust solution. Addressing this type of implementation detail
      would be a good candidate for the GR-description draft.

We should understand why N2 sends PathErr. "Resources in use" could either 
mean that *all* resources are already in use, or the limit resources 
required on the Path message (i.e. Upstream Label or Label Set) are already 
in use.

In all cases, the error code should indicate that the requested new resource 
allocation cannot be satisfied (N2 thinks this is a new LSP).

In the case of no resource being available, the PathErr MUST [3473] contain 
"Routing problem/MPLS label allocation failure".
In the case of failure because the Label Set cannot be satisfied, the 
PathErr MUST [3473] carry "Routing problem/Label Set". In the case of 
failure because of Upstream Label N2 MUST [3473] send a PathErr with 
"Routing problem/Unacceptable label value" and MAY include an Acceptable 
Label Set object.

So, I would suggest that N1 may be over-reacting to the PathErr from N2. 
Presumably N1 expects that the LSP is up and running - ii already has Resv 
state and data is probably flowing. So any of these three error codes (all 
of which represent LSP setup errors) should cause N1 to suspect something 
slightly strange is happening. Before getting too agitated and taking the 
dramatic step of tearing the LSP, it should just check that everything else 
is functioning correctly, and part of that process would be to send a Hello 
if one has not been sent/received for a considerable period of time.

In fact, your situation arises either because the Hello period is set far to 
large (i.e. larger than the neighbor's whole restart cycle) or because N1 is 
not considering any difference between normal and hello-degraded states. The 
former is a configuration error. The latter should allow the PathErr to be 
treated with more care than a simple PathTear.

[SCB] I believe what we are getting into is the clarification of the detailed
      behaviors to cover the various scenarios. I agree that the GR-description
      ID would be the ideal place to clarify these.

With regard to your second question. Yes, we believe that all cases of 
multiple failures are handled by the restart procedures without refresh 
failure causing state to be deleted. However, the text for this was removed 
from the graceful restart draft before publication (actually, long ago) as 
second order failures really clogged up the draft. Instead, we have 
draft-ietf-ccamp-gr-description-00.txt.

I suggest:

1. The situation you describe in your first scenario (with and without 
SRefresh) should be included in the GR-Description I-D.

2. You should check the GR-Description I-D to see whether it answers your 
questions about multiple failures.

In both cases, I am sure the authors would welcome suggested text and 
pointer to what they could change.

[SCB] Sure, I will provide some text.

Cheers,
Adrian
----- Original Message ----- 
From: "Bardalai, Snigdho" <Snigdho.Bardalai@us.fujitsu.com>
To: "Ccamp (E-mail)" <ccamp@ops.ietf.org>
Sent: Friday, October 05, 2007 5:18 PM
Subject: Questions on RSVP-TE Graceful Restart and the new Extensions


Hi,

I have a couple of questions on RSVP-TE Graceful Restart and the new 
extensions being propose in draft-ietf-ccamp-rsvp-restart-ext-09.

Did anybody come across any issues when the hello interval duration times 
the failure multiple (typically 3) is too large compared to the neighboring 
node restart duration? For example, if the RSVP-TE interval is 10 seconds, 
the multiple is 3 and the neighboring node restarts within 10 seconds then 
it is possible that the RSVP-TE hello will never detect a hello failure.

RFC3473 does describe detection of a node restart in this case based on a 
new source instance in the hello message, but we have come across an issue 
with NACKs being generated for an Srefresh message in this scenario.

Please look-at the sequence diagram below:

  N1                                N2
  |                                 |
  |                                 X (Restart start)
  |  HELLO                          |
  |-------------------------------->|
  |                                 |
  |  SRefresh                       |
  |-------------------------------->|
  |                                 |
  |  HELLO                          |
  |-------------------------------->|
  |                                 |
  |                                 X (Restart complete)
  |  SRefresh                       |
  |-------------------------------->|
  |  NACK                           |
  |<--------------------------------|
  |  Path (without recovery label)  |
  |-------------------------------->|
  |                                 X (resoure allocation failed because the 
resouces are in use)
  |  PathErr                        |
  |<--------------------------------|
  |  PathTear                       |
  |-------------------------------->|
  X (CON deletion)                  X (XCON deletion)
  |                                 |

The issue is because N1 did not detect a hello failure it continues sending 
SRefreshes which may get NACKed by N2 once restart completes because there 
is no Path state corresponding to the SRefresh message. This NACK causes a 
Path refresh message to be generated but there is no RECOVERY_LABEL because 
N1 did not yet detect that N2 has restarted because hello exchanges have not 
yet started. PLEASE NOTE: This is based on an actual implementation and a 
real test.

What is the solution to this issue because I don't see either N1 or N2 doing 
anything that is not compliant as per the current RFCs? Or is there 
something I have missed?

The other issue I wanted to understand is with respect to the graceful 
restart extension. Will the RecoveryPath message handle issues when 
communication fails and a node restarts? There may be issues when somes 
nodes in the LSP path gets isolated from both upstream and downstream ends.

Example,

             A---B-x...x-C---D---E-x...x-F---G

Nodes C, D and E are isolated. If this condition persists and node's C,D and 
E restarts. Will the LSP get deleted after the recovery timer expires in 
node D? Can this be prevented ?

Would appreciate your response.

Regards,
Snigdho

References:
- Re: Questions on RSVP-TE Graceful Restart and the new Extensions
  - From: "Adrian Farrel" <adrian@olddog.co.uk>

Prev by Date: Re: Thoughts on draft-otani-ccamp-gmpls-lambda-labels-00.txt
Next by Date: RE: Thoughts on draft-otani-ccamp-gmpls-lambda-labels-00.txt
Previous by thread: RE: Questions on RSVP-TE Graceful Restart and the new Extensions
Next by thread: Re: Questions on RSVP-TE Graceful Restart and the new Extensions
Index(es):
- Date
- Thread