[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: More confirmed-commit issues



Rob Enns wrote:

Hi, comments below.

ditto



-----Original Message-----
From: Andy Bierman [mailto:ietf@andybierman.com] Sent: Saturday, May 14, 2005 5:13 AM
To: Rob Enns
Cc: netconf
Subject: Re: More confirmed-commit issues


Rob Enns wrote:



How does this replacement text sound?

----
8.4  Confirmed Commit Capability

8.4.1  Description

 The #confirmed-commit capability indicates that the server will
 support the <confirmed> and <confirm-timeout> parameters for the
 <commit> protocol operation.  See section Section 8.3 for further
 details on the <commit> operation.

A confirmed commit operation MUST be reverted if a

follow-up commit


(called the "confirming commit") is not issued within 600

seconds (10


 minutes).  The timeout period can be adjusted with the <confirm-
 timeout> element.  The confirming commit can itself include a
 <confirmed> parameter.




This last sentence is confusing to me. It makes sense if the <candidate> contains
new changes and the 2nd confirmed commit starts a new "revert timeout" for these
new changes.



That's the intent. I mention it here only to indicate that the
confirming commit is not magic, it's a regular commit that could
itself be confirmed or make additional changes.


ok



I really don't like the possible side effects from this confirmed commit, especially with our
shared <candidate> and global locking. If you don't maintain the session and hold the
lock throughout the entire double commit, really bad things can happen.


(NEW ISSUE: What happens to a confirmed commit in progress if the session is lost
or the agent reboots?)



To confirm the above issues:

If the session doing the confirmed commit is lost, the confirmed commit continues.

If the agent reboots in the middle of a confirmed commit, I assume the box boots
with the new config, so an agent reboot acts like a 2nd commit. Yuch. Or does
the agent remember that a revert timeout was pending? If the timer doesn't
survive, and the first commit CAUSED the reboot, isn't this device in an
endless reboot loop? If the crash happens in the startup sequence, before the
timer can pop, it's in an endless reboot loop anyway.


T0 - boot with baseline config
Tc - Manager A issues a confirmed commit, w/ revert to baseline at Tc+i
Tc+1 - Manager A loses its connection and session
Tc+10 - Manager B has no idea Manager A did this, comes along, gets the lock,
and starts writing to the <candidate> config, which starts with the contents
of <running> at time Tc
Tc+20 - Then Manager A comes back and can't get a lock
Tc+i - Manager A's revert timer pops before Manager B is done
The agent reverts the state of <running> to T0. (But B thinks the
state of <running> is Tc).


At this point, it depends on the difference between config T0 and Tc, and
what Manager B is doing, as to whether benign or devastating effects will follow.


It's never a good thing to design this much "astonishment" into routing products.
At a minimum, we need to document what happens in as many corner cases as
we can think of, but we should also try to respect the principle of least astonishment.



I don't view this as astonishing, or a side effect. It's very
simple: when the timer pops from a confirmed commit, the device
will revert to the T0 configuration. I'd argue that it's the kind
of easy to understand basic behavior that operators like.


It is astonishing to Mgr B who has no way of knowing a revert
timeout is pending. To me, the whole thing is just fragile. IMO, a configuration protocol should be robust, not fragile.


A protocol that can allow the possibility of severely detrimental
config changes (through unintended or malicious acts), by merely
dropping a connection, is fragile.

It's possible the security AD could have a problem with this too,
during the IESG review.




If a confirming commit is not issued, the device will revert it's
configuration to the state prior to the issuance of the confirmed
commit. Note that any commit operation, including a commit which
introduces additional changes to the configuration, will


serve as a


 confirming commit.  Thus to cancel a confirmed commit and revert
 changes without waiting for the confirm timeout to expire, the
 confirming commit can explicitly restore the configuration to it's
 state before the confirmed commit was issued.




I don't understand this last sentence, and this revert operation at all.



This is in reponse to your comment about how the configuration would
be reverted before the timer pops. We can't use the rollback operation
to explain it, because netconf doesn't have one at this point. I could
remove that text completely if it's confusing.


The fact that you have a rollback operation in Junoscript doesn't really
apply to this document. The sentence doesn't convey the idea that the
confirmed commit can be canceled through proprietary mechanisms, outside
the scope of the standard.


IMO, we need to remove this sentence since there is no rollback operation in netconf.
In fact, we should say instead that netconf provides no mechanism to force the
agent to cancel the confirmed commit and revert the <running> configuration.
The manager has to wait for the timeout interval to pass.


BTW, the phrase "the confirming commit can explicitly restore the configuration"
doesn't really make sense. s/confirming commit/manager/ and it does.




BTW, s/it's/its/ in both paragraphs above.



Quite right, thanks.

Rob



Andy


-- to unsubscribe send a message to netconf-request@ops.ietf.org with the word 'unsubscribe' in a single line as the message text body. archive: <http://ops.ietf.org/lists/netconf/>