[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: soft state (was Re: shim6 and bit errors in data packet headers



marcelo bagnulo braun wrote:

Why would we want to couple the state management aspects of shim6 but the shim6 test protocol? To me any such coupling seems undesirable, especially since the parameters for the test protocol (how quickly to detect failures) might be a function of upper layer advise, as well as upper layer hints of "working" or "not working".


Well, i guess that the situation when one of the nodes has lost the shim state can be seen as a form of failure and my assumption is that failure detection mechanisms will likely detect it first

But that's a circular argument for including the context state in the failure detection mechanism. You are in effect saying that the test protocol should test whether the context has been lost on the peer since it can be made to test for a lost context on the peer.


FWIW the outline of a test protocol in section 5.4 of draft-arkko-multi6dt-failure-detection-00.txt doesn't assume such a thing. (But it does assume that B remember something about previously received probes, so there are some issues about DoS opportunities.)

I think that the protocol behaviour would be something like this.

A communication is established between node A and node B
Later on, a shim context is created between those two nodes.
The parameters for that context are:
  ULIDs: IPA1 and IPB1
  Locators: for IPA1 (IPA1,...,IPAn)
            for IPB1 (IPB1,...,IPBm)

And a context tag presumably.

Suppose that for some reason node B losses the shim context (and only the shim context, i.e. the application and transport state about ongoing communications is preserved)

I guess that at this point we have several scenarios to consider:

Scenario a): the communication between A and B is still using IPA1 and IPB1 as locators.
This scenario has two subcases:
Scenario a.1) The communication is bidirectional and e.g.
TCP is providing ack of the progress of the communication
this means that no periodic reachability test
nor any other shim signaling is being exchanged.
In this scenario, a lost of SHIM context would remain
undetected until there is a failure and node A detects it
and tries to explore alternative paths. This is so because
data packets will carry ULIDs and will be passed successfully
to the upper layers.

If we assume that B (as well as A) will have a heuristic to create shim6 contexts (e.g. based on having received 50 packets for a locator pair), then this heuristic might be trigger and cause B to try to establish a context with A, at which point in time A will see that it already has a context with B.


Once that there is a failure, then
              reachability test packets won't be recognized as belonging
              to any existent shim context and the problem can be detected.

Here you are already assuming that reachability test packets will not be recognized, i.e. presupposing a particular interaction between the state management and the test protocol.


Scenario a.2) The communication is unidirectional
              In this case, periodic reachability test need to be
              performed in order to verify that the path is still working
              If the node B losses its shim state, it won't recongnize
              the reachability test packets, and the lost of context can
              be detected

Again, here you are presupposing a particular interaction.

Scenario b) the communication between A and B is using alternative locators.
In this case, when node B losses the context, data packets won't be properly delivered in node B, because it won't be properly demuxed.
At this point, the reachability test will be performed to verify the locator pair being used

If you are using alternate locators and the working locator pair is unidirectional, then it seems like you'd need to be able to re-discover that working unidirectional locator pair, before you can re-establish the context state on B.
Thus if A is sending using IPA1->IPB2 and B was replying using IPB1->IPA2, and B looses the context state, what do you do?
Seems like solving this case requires that the test protocol is not tied in with the state management.


I don't know if i am missing something, but AFAICS, all the situations when the shim context is lost result in a reachability test exchange, and that is why i was wondering if it wouldn't make sense to define a "no-context" error message as a rply to a reachability test request packet.

That is one particular solution with strong coupling between the test protocol and the state management.


But don't we want to retain the possibility to test locator pairs for initial contact, i.e. before a context is established between the peers? And handle the above case of unidirectional locator pairs?


But i fail to understand how the node that has lost the state can identify that a data packet belongs to a non existent shim state....

By seeing that the <source locator, destination locator, context tag> doesn't match any existing context?
I suspect we want that capability for robustness in any case.


I mean, i guess that a first element that is relevant here is where are we going to carry the context tag.
If the context tag is carried in a extension header or dest option, then i can see that if a node receives an packet with one of those, can easily detect that there is no context associated. (note that in this case, the context loss is only detected in the case where the locators used for the communication differ from the ULIDs, i.e. the extension header dst option is included in the packet)


If the context tag is included in the flow label, then i don't see how a node that receives the data packet can determine that the packet is associated to a shim context that is no longer there. At this point, i gues that as you mentioned in a previous mail, the data packet would be silently discarded, right?

If the context tag is carried as a flow label, I still think we need a way to tell the receiver "this is a shim6 packet". For robustness reasons I think the fact that the packet needs shim6 processing should be explicit.
There has been proposals in multi6 which suggested doing this without making the packets larger by defining a set of new nexthdr values with meaning like
shim6+tcp
shim6+udp
...
shim6+esp


Not having that "shim6" bit when the flow label is used as a context tag can easily result in hard to diagnose errors. We might have errors due to some middlebox messing with the data packets (a TCP relay for instance), but that leaves the shim6 test packets alone. If the TCP relay doesn't preserve the flow label, then the packets would be dropped due to TCP checksum errors (since the ULID rewrite didn't happen), but the test protocol would say that everything is fine.

I think that at this point is clear to me that if we define a no-context error message, this message should be defined as a reply to a packet that refers to that context and it should include enough information about this initial packet to verify that is a reply to that packet.

The no-context error message cannot be issued spontaneously by a node.

Agreed.

  Erik