[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: More confirmed-commit issues



Hi,

My two cents - if a session/connection breaks before
a commit, then rollback to previous/current config
MUST happen as a hard requirement of NetConf.

The idea that Manager B gets (all unknowing) the
loose ends from Manager A's previous session will
_never_ get past IESG Security review.

Cheers,
- Ira

Ira McDonald (Musician / Software Architect)
Blue Roof Music / High North Inc
PO Box 221  Grand Marais, MI  49839
phone: +1-906-494-2434
email: imcdonald@sharplabs.com

> -----Original Message-----
> From: owner-netconf@ops.ietf.org [mailto:owner-netconf@ops.ietf.org]On
> Behalf Of Rob Enns
> Sent: Wednesday, May 18, 2005 3:34 PM
> To: Andy Bierman
> Cc: netconf
> Subject: RE: More confirmed-commit issues
> 
> 
> > >>  (NEW ISSUE: What happens to a confirmed commit in 
> progress if the 
> > >>session is lost
> > >>   or the agent reboots?)
> > 
> > To confirm the above issues:
> > 
> > If the session doing the confirmed commit is lost, the 
> > confirmed commit 
> > continues.
> > 
> > If the agent reboots in the middle of a confirmed commit, I 
> > assume the 
> > box boots
> > with the new config, so an agent reboot acts like a 2nd 
> > commit.  Yuch.  
> > Or does
> > the agent remember that a revert timeout was pending?  If the 
> > timer doesn't
> > survive, and the first commit CAUSED the reboot, isn't this 
> > device in an
> > endless reboot loop?  If the crash happens in the startup sequence, 
> > before the
> > timer can pop, it's in an endless reboot loop anyway.
> 
> I think there are 2 cases here:
> 1) intentional reboot
> -> if an operator intentionally reboots the box in the
> middle of the confirmed commit, I'd say that effectively
> confirms the commit, and we should explicitly mention 
> this case in the protocol spec.
> 
> 2) unintentional reboot (aka bug)
> -> we can't standardize what netconf does in this case, right?
> 
> 
> 
> > >> T0 - boot with baseline config
> > >> Tc -  Manager A issues a confirmed commit, w/ revert to 
> > >>baseline at Tc+i
> > >> Tc+1 - Manager A loses its connection and session
> > >> Tc+10 - Manager B has no idea Manager A did this, comes 
> > >>along, gets the 
> > >>lock,
> > >>               and starts writing to the <candidate> 
> config, which  
> > >>starts with  the contents
> > >>               of <running> at time Tc
> > >>Tc+20 - Then Manager A comes back and can't get a lock
> > >>Tc+i   - Manager A's revert timer pops before Manager B is done
> > >>            The agent reverts the state of <running> to 
> T0.  (But B 
> > >>thinks the
> > >>            state of <running> is Tc).
> > >>
> > >>At this point, it depends on the difference between config T0 
> > >>and Tc, and
> > >>what Manager B is doing, as to whether benign or 
> > devastating effects 
> > >>will follow.
> > >>
> > >>It's never a good thing to design this much "astonishment" 
> > >>into routing 
> > >>products.
> > >>At a minimum, we need to document what happens in as many 
> > >>corner cases as
> > >>we can think of, but we should also try to respect the 
> principle of 
> > >>least astonishment.
> > >>    
> > >>
> > >
> > >I don't view this as astonishing, or a side effect. It's very
> > >simple: when the timer pops from a confirmed commit, the device
> > >will revert to the T0 configuration. I'd argue that it's the kind
> > >of easy to understand basic behavior that operators like.
> > >  
> > >
> > It is astonishing to Mgr B who has no way of knowing a revert
> > timeout is pending. To me, the whole thing is just fragile. 
> > IMO, a configuration protocol should be robust, not fragile.
> 
> I don't think it's fragile. The problem in this scenario is that
> Mgr A and Mgr B don't know what each other are doing. We can't
> standardize a way out of badly managed networks.
> 
> > A protocol that can allow the possibility of severely detrimental
> > config changes (through unintended or malicious acts), by merely
> > dropping a connection, is fragile.
> 
> Why would reverting to the T0 configuration, which is both what Mgr A
> wanted to do, and what the device was running before, be severely
> detrimental?
> 
> We can sit around making up corner cases where Mgr A and Mgr B don't
> know what each other are doing, that's easy to do and not very
> productive. There's no way the device ends up with a sane 
> configuration
> at the end of the day using _any_ configuration method, if 
> the entities
> doing the configuration aren't coordinated. 
> 
> The question is, what's the risk/reward of standardizing a feature
> like confirmed commit. The risk is that operators that aren't aware
> that a confirmed commit is underway could lose changes. The reward is
> that we have a standardized way to protect against devices falling
> off the network due to a change. IMO there a very clear benefit which
> outweighs the risk. And the risk is explicitly identified in the 
> protocol document.
> 
> > It's possible the security AD could have a problem with this too,
> > during the IESG review.
> >
> > >>>  If a confirming commit is not issued, the device will 
> revert it's
> > >>>  configuration to the state prior to the issuance of 
> the confirmed
> > >>>  commit.  Note that any commit operation, including a 
> commit which
> > >>>  introduces additional changes to the configuration, will 
> > >>>      
> > >>>
> > >>serve as a
> > >>    
> > >>
> > >>>  confirming commit.  Thus to cancel a confirmed commit 
> and revert
> > >>>  changes without waiting for the confirm timeout to expire, the
> > >>>  confirming commit can explicitly restore the 
> > configuration to it's
> > >>>  state before the confirmed commit was issued.
> > >>> 
> > >>>
> > >>>      
> > >>>
> > >>I don't understand this last sentence, and this revert 
> > >>operation at all.
> > >>    
> > >>
> > >
> > >This is in reponse to your comment about how the 
> configuration would
> > >be reverted before the timer pops. We can't use the rollback 
> > operation
> > >to explain it, because netconf doesn't have one at this 
> > point. I could
> > >remove that text completely if it's confusing.
> > >  
> > >
> > The fact that you have a rollback operation in Junoscript 
> > doesn't really
> > apply to this document.  The sentence doesn't convey the 
> idea that the
> > confirmed commit can be canceled through proprietary 
> > mechanisms, outside
> > the scope of the standard. 
> 
> Sorry for the confusion, that's not what this is saying. The point is 
> that one can use netconf as specified to restore the configuration 
> using edit-config.
> 
> It has nothing to do with a proprietary mechanism.
> 
> > IMO, we need to remove this sentence since there is no rollback 
> > operation in netconf.
> > In fact, we should say instead that netconf provides no 
> mechanism to 
> > force the
> > agent to cancel the confirmed commit and revert the <running> 
> > configuration.
> > The manager has to wait for the timeout interval to pass.
> 
> That's not true. Either way the cancel/revert is a manager initiated
> action. It's a little clunky for the manager to restore the
> configuration
> using edit-config, but it works. This seems to be causing confusion
> so I can remove it, but I don't think it's accurate to say 
> that netconf
> provides no mechanism to revert the running configuration. It provides
> edit-config, which is awkward (because the manager has to have the
> configuration in hand) but works.
> 
> > BTW, the phrase "the confirming commit can explicitly restore the 
> > configuration"
> > doesn't really make sense. s/confirming commit/manager/ and it does.
> 
> Yes, that sounds good, thanks.
> 
> Rob
> 
> --
> to unsubscribe send a message to netconf-request@ops.ietf.org with
> the word 'unsubscribe' in a single line as the message text body.
> archive: <http://ops.ietf.org/lists/netconf/>
> 

--
to unsubscribe send a message to netconf-request@ops.ietf.org with
the word 'unsubscribe' in a single line as the message text body.
archive: <http://ops.ietf.org/lists/netconf/>