[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: More confirmed-commit issues
Hi,
My two cents - if a session/connection breaks before
a commit, then rollback to previous/current config
MUST happen as a hard requirement of NetConf.
The idea that Manager B gets (all unknowing) the
loose ends from Manager A's previous session will
_never_ get past IESG Security review.
Cheers,
- Ira
Ira McDonald (Musician / Software Architect)
Blue Roof Music / High North Inc
PO Box 221 Grand Marais, MI 49839
phone: +1-906-494-2434
email: imcdonald@sharplabs.com
> -----Original Message-----
> From: owner-netconf@ops.ietf.org [mailto:owner-netconf@ops.ietf.org]On
> Behalf Of Rob Enns
> Sent: Wednesday, May 18, 2005 3:34 PM
> To: Andy Bierman
> Cc: netconf
> Subject: RE: More confirmed-commit issues
>
>
> > >> (NEW ISSUE: What happens to a confirmed commit in
> progress if the
> > >>session is lost
> > >> or the agent reboots?)
> >
> > To confirm the above issues:
> >
> > If the session doing the confirmed commit is lost, the
> > confirmed commit
> > continues.
> >
> > If the agent reboots in the middle of a confirmed commit, I
> > assume the
> > box boots
> > with the new config, so an agent reboot acts like a 2nd
> > commit. Yuch.
> > Or does
> > the agent remember that a revert timeout was pending? If the
> > timer doesn't
> > survive, and the first commit CAUSED the reboot, isn't this
> > device in an
> > endless reboot loop? If the crash happens in the startup sequence,
> > before the
> > timer can pop, it's in an endless reboot loop anyway.
>
> I think there are 2 cases here:
> 1) intentional reboot
> -> if an operator intentionally reboots the box in the
> middle of the confirmed commit, I'd say that effectively
> confirms the commit, and we should explicitly mention
> this case in the protocol spec.
>
> 2) unintentional reboot (aka bug)
> -> we can't standardize what netconf does in this case, right?
>
>
>
> > >> T0 - boot with baseline config
> > >> Tc - Manager A issues a confirmed commit, w/ revert to
> > >>baseline at Tc+i
> > >> Tc+1 - Manager A loses its connection and session
> > >> Tc+10 - Manager B has no idea Manager A did this, comes
> > >>along, gets the
> > >>lock,
> > >> and starts writing to the <candidate>
> config, which
> > >>starts with the contents
> > >> of <running> at time Tc
> > >>Tc+20 - Then Manager A comes back and can't get a lock
> > >>Tc+i - Manager A's revert timer pops before Manager B is done
> > >> The agent reverts the state of <running> to
> T0. (But B
> > >>thinks the
> > >> state of <running> is Tc).
> > >>
> > >>At this point, it depends on the difference between config T0
> > >>and Tc, and
> > >>what Manager B is doing, as to whether benign or
> > devastating effects
> > >>will follow.
> > >>
> > >>It's never a good thing to design this much "astonishment"
> > >>into routing
> > >>products.
> > >>At a minimum, we need to document what happens in as many
> > >>corner cases as
> > >>we can think of, but we should also try to respect the
> principle of
> > >>least astonishment.
> > >>
> > >>
> > >
> > >I don't view this as astonishing, or a side effect. It's very
> > >simple: when the timer pops from a confirmed commit, the device
> > >will revert to the T0 configuration. I'd argue that it's the kind
> > >of easy to understand basic behavior that operators like.
> > >
> > >
> > It is astonishing to Mgr B who has no way of knowing a revert
> > timeout is pending. To me, the whole thing is just fragile.
> > IMO, a configuration protocol should be robust, not fragile.
>
> I don't think it's fragile. The problem in this scenario is that
> Mgr A and Mgr B don't know what each other are doing. We can't
> standardize a way out of badly managed networks.
>
> > A protocol that can allow the possibility of severely detrimental
> > config changes (through unintended or malicious acts), by merely
> > dropping a connection, is fragile.
>
> Why would reverting to the T0 configuration, which is both what Mgr A
> wanted to do, and what the device was running before, be severely
> detrimental?
>
> We can sit around making up corner cases where Mgr A and Mgr B don't
> know what each other are doing, that's easy to do and not very
> productive. There's no way the device ends up with a sane
> configuration
> at the end of the day using _any_ configuration method, if
> the entities
> doing the configuration aren't coordinated.
>
> The question is, what's the risk/reward of standardizing a feature
> like confirmed commit. The risk is that operators that aren't aware
> that a confirmed commit is underway could lose changes. The reward is
> that we have a standardized way to protect against devices falling
> off the network due to a change. IMO there a very clear benefit which
> outweighs the risk. And the risk is explicitly identified in the
> protocol document.
>
> > It's possible the security AD could have a problem with this too,
> > during the IESG review.
> >
> > >>> If a confirming commit is not issued, the device will
> revert it's
> > >>> configuration to the state prior to the issuance of
> the confirmed
> > >>> commit. Note that any commit operation, including a
> commit which
> > >>> introduces additional changes to the configuration, will
> > >>>
> > >>>
> > >>serve as a
> > >>
> > >>
> > >>> confirming commit. Thus to cancel a confirmed commit
> and revert
> > >>> changes without waiting for the confirm timeout to expire, the
> > >>> confirming commit can explicitly restore the
> > configuration to it's
> > >>> state before the confirmed commit was issued.
> > >>>
> > >>>
> > >>>
> > >>>
> > >>I don't understand this last sentence, and this revert
> > >>operation at all.
> > >>
> > >>
> > >
> > >This is in reponse to your comment about how the
> configuration would
> > >be reverted before the timer pops. We can't use the rollback
> > operation
> > >to explain it, because netconf doesn't have one at this
> > point. I could
> > >remove that text completely if it's confusing.
> > >
> > >
> > The fact that you have a rollback operation in Junoscript
> > doesn't really
> > apply to this document. The sentence doesn't convey the
> idea that the
> > confirmed commit can be canceled through proprietary
> > mechanisms, outside
> > the scope of the standard.
>
> Sorry for the confusion, that's not what this is saying. The point is
> that one can use netconf as specified to restore the configuration
> using edit-config.
>
> It has nothing to do with a proprietary mechanism.
>
> > IMO, we need to remove this sentence since there is no rollback
> > operation in netconf.
> > In fact, we should say instead that netconf provides no
> mechanism to
> > force the
> > agent to cancel the confirmed commit and revert the <running>
> > configuration.
> > The manager has to wait for the timeout interval to pass.
>
> That's not true. Either way the cancel/revert is a manager initiated
> action. It's a little clunky for the manager to restore the
> configuration
> using edit-config, but it works. This seems to be causing confusion
> so I can remove it, but I don't think it's accurate to say
> that netconf
> provides no mechanism to revert the running configuration. It provides
> edit-config, which is awkward (because the manager has to have the
> configuration in hand) but works.
>
> > BTW, the phrase "the confirming commit can explicitly restore the
> > configuration"
> > doesn't really make sense. s/confirming commit/manager/ and it does.
>
> Yes, that sounds good, thanks.
>
> Rob
>
> --
> to unsubscribe send a message to netconf-request@ops.ietf.org with
> the word 'unsubscribe' in a single line as the message text body.
> archive: <http://ops.ietf.org/lists/netconf/>
>
--
to unsubscribe send a message to netconf-request@ops.ietf.org with
the word 'unsubscribe' in a single line as the message text body.
archive: <http://ops.ietf.org/lists/netconf/>