[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: More confirmed-commit issues

To: "Andy Bierman" <ietf@andybierman.com>
Subject: RE: More confirmed-commit issues
From: "Rob Enns" <rpe@juniper.net>
Date: Wed, 18 May 2005 12:33:53 -0700
Cc: "netconf" <netconf@ops.ietf.org>
> >>  (NEW ISSUE: What happens to a confirmed commit in progress if the 
> >>session is lost
> >>   or the agent reboots?)
> 
> To confirm the above issues:
> 
> If the session doing the confirmed commit is lost, the 
> confirmed commit 
> continues.
> 
> If the agent reboots in the middle of a confirmed commit, I 
> assume the 
> box boots
> with the new config, so an agent reboot acts like a 2nd 
> commit.  Yuch.  
> Or does
> the agent remember that a revert timeout was pending?  If the 
> timer doesn't
> survive, and the first commit CAUSED the reboot, isn't this 
> device in an
> endless reboot loop?  If the crash happens in the startup sequence, 
> before the
> timer can pop, it's in an endless reboot loop anyway.

I think there are 2 cases here:
1) intentional reboot
-> if an operator intentionally reboots the box in the
middle of the confirmed commit, I'd say that effectively
confirms the commit, and we should explicitly mention 
this case in the protocol spec.

2) unintentional reboot (aka bug)
-> we can't standardize what netconf does in this case, right?



> >> T0 - boot with baseline config
> >> Tc -  Manager A issues a confirmed commit, w/ revert to 
> >>baseline at Tc+i
> >> Tc+1 - Manager A loses its connection and session
> >> Tc+10 - Manager B has no idea Manager A did this, comes 
> >>along, gets the 
> >>lock,
> >>               and starts writing to the <candidate> config, which  
> >>starts with  the contents
> >>               of <running> at time Tc
> >>Tc+20 - Then Manager A comes back and can't get a lock
> >>Tc+i   - Manager A's revert timer pops before Manager B is done
> >>            The agent reverts the state of <running> to T0.  (But B 
> >>thinks the
> >>            state of <running> is Tc).
> >>
> >>At this point, it depends on the difference between config T0 
> >>and Tc, and
> >>what Manager B is doing, as to whether benign or 
> devastating effects 
> >>will follow.
> >>
> >>It's never a good thing to design this much "astonishment" 
> >>into routing 
> >>products.
> >>At a minimum, we need to document what happens in as many 
> >>corner cases as
> >>we can think of, but we should also try to respect the principle of 
> >>least astonishment.
> >>    
> >>
> >
> >I don't view this as astonishing, or a side effect. It's very
> >simple: when the timer pops from a confirmed commit, the device
> >will revert to the T0 configuration. I'd argue that it's the kind
> >of easy to understand basic behavior that operators like.
> >  
> >
> It is astonishing to Mgr B who has no way of knowing a revert
> timeout is pending. To me, the whole thing is just fragile. 
> IMO, a configuration protocol should be robust, not fragile.

I don't think it's fragile. The problem in this scenario is that
Mgr A and Mgr B don't know what each other are doing. We can't
standardize a way out of badly managed networks.

> A protocol that can allow the possibility of severely detrimental
> config changes (through unintended or malicious acts), by merely
> dropping a connection, is fragile.

Why would reverting to the T0 configuration, which is both what Mgr A
wanted to do, and what the device was running before, be severely
detrimental?

We can sit around making up corner cases where Mgr A and Mgr B don't
know what each other are doing, that's easy to do and not very
productive. There's no way the device ends up with a sane configuration
at the end of the day using _any_ configuration method, if the entities
doing the configuration aren't coordinated. 

The question is, what's the risk/reward of standardizing a feature
like confirmed commit. The risk is that operators that aren't aware
that a confirmed commit is underway could lose changes. The reward is
that we have a standardized way to protect against devices falling
off the network due to a change. IMO there a very clear benefit which
outweighs the risk. And the risk is explicitly identified in the 
protocol document.

> It's possible the security AD could have a problem with this too,
> during the IESG review.
>
> >>>  If a confirming commit is not issued, the device will revert it's
> >>>  configuration to the state prior to the issuance of the confirmed
> >>>  commit.  Note that any commit operation, including a commit which
> >>>  introduces additional changes to the configuration, will 
> >>>      
> >>>
> >>serve as a
> >>    
> >>
> >>>  confirming commit.  Thus to cancel a confirmed commit and revert
> >>>  changes without waiting for the confirm timeout to expire, the
> >>>  confirming commit can explicitly restore the 
> configuration to it's
> >>>  state before the confirmed commit was issued.
> >>> 
> >>>
> >>>      
> >>>
> >>I don't understand this last sentence, and this revert 
> >>operation at all.
> >>    
> >>
> >
> >This is in reponse to your comment about how the configuration would
> >be reverted before the timer pops. We can't use the rollback 
> operation
> >to explain it, because netconf doesn't have one at this 
> point. I could
> >remove that text completely if it's confusing.
> >  
> >
> The fact that you have a rollback operation in Junoscript 
> doesn't really
> apply to this document.  The sentence doesn't convey the idea that the
> confirmed commit can be canceled through proprietary 
> mechanisms, outside
> the scope of the standard. 

Sorry for the confusion, that's not what this is saying. The point is 
that one can use netconf as specified to restore the configuration 
using edit-config.

It has nothing to do with a proprietary mechanism.

> IMO, we need to remove this sentence since there is no rollback 
> operation in netconf.
> In fact, we should say instead that netconf provides no mechanism to 
> force the
> agent to cancel the confirmed commit and revert the <running> 
> configuration.
> The manager has to wait for the timeout interval to pass.

That's not true. Either way the cancel/revert is a manager initiated
action. It's a little clunky for the manager to restore the
configuration
using edit-config, but it works. This seems to be causing confusion
so I can remove it, but I don't think it's accurate to say that netconf
provides no mechanism to revert the running configuration. It provides
edit-config, which is awkward (because the manager has to have the
configuration in hand) but works.

> BTW, the phrase "the confirming commit can explicitly restore the 
> configuration"
> doesn't really make sense. s/confirming commit/manager/ and it does.

Yes, that sounds good, thanks.

Rob

--
to unsubscribe send a message to netconf-request@ops.ietf.org with
the word 'unsubscribe' in a single line as the message text body.
archive: <http://ops.ietf.org/lists/netconf/>
Follow-Ups:
- Re: More confirmed-commit issues
  - From: Andy Bierman <ietf@andybierman.com>
Prev by Date: Re: [xml-dir] request review of NETCONF protocol
Next by Date: RE: More confirmed-commit issues
Previous by thread: Re: More confirmed-commit issues
Next by thread: Re: More confirmed-commit issues
Index(es):
- Date
- Thread