[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: failure detection



Hi John,

some comments below...

El 15/08/2005, a las 13:36, <john.loughney@nokia.com> escribió:

Iljitsch,


Somewhat to my disappointment, there were no opinions about
which of the two failure detection mechanisms is better,
either during the sessions or afterward on the list.

At least for me, I'm still puzzeling over how to capture failure ... Between 2 hosts, tcp traffic may work but not udp traffic (mostly due to stupid middleboxes). Is this a failure from the shim point of view?


i guess we agree that the shim won't be able to track each different ULP communication, so the shim won't be able to know if each communication is progressing adequately by its own, right?


so, the shim can only see the exchange of packets with the other end, and determine if packets are flowing with a given frequency. If packets stop flowing, the shim can guess a potential failure.
Now if packets are flowing (even if packets are only flowing to a given port and not to others) the shim by itself won't be able to detect this problem AFAICT.


I guess that the only one that can identify this problem is the ULP itself.

So i guess we have the following situation:
- The shim will only be able to detect (by itself, without additional information) failures when packets stop flowing
So, when there is not additional information, the case that you are considering would not be detected by the shim


- This case can be detected when there is additional information such as ULP feedback and perhaps some ICMP meesage
In this case, we have the following scenarios:
a) different ULP provide contradictory feedback
b) only the UDP app provides negative feedback but the shim can see that packets are still flowing (and maybe reachability tests are successful)
c) the SHIM receives ICMP errors but packets are still flowing.


I guess that in those scenarios thee shim can detect that something strange in the lines that you have considered is going on, and maybe some smart decision could be taken to deal with this... but before that, do you agree with this description of the scenarios?

Also, different transports have different concepts of what 'failure'
is. When is a path considered failed? Classical TCP failure
(retransmission
of the the same packet x times)?  Too high bit-error rate?  I think we
might
run into circumstances where if we let the shim layer decide
conclusively
that a path has failed, it might decide much later that the path is bad
than what the transport layer knows.


well, imho the shim needs only to define a default failure mode. I mean, it is clear that different apps have different perception of what a failure may be, but, since the shim is a generic layer, we need to provide a generic definition of what a failure is. Clearly, i guess it would be very useful to allow ULPs to provide information to the shim about when they consider that a failure has occurred.


So, i would say that we will have:
- A generic definition of failure that is used when ULPs don't provide additional information
- each ULP can provide feedback about when they think that a failure has occurred.


For the generic failure definition, i think that the first hint that a failure may have occurred is that packets are flowing outwards but no incoming packets are received (for a given period T)
At this point i guess that an explicit reachability test needs to be performed.



I think we should avoid any transport layer functions in the shim layer
(or at least as much as possible).  I'm leaning more towards the
rich-api
type of functionality, where the shim provides as much info as it can
to the transport layers, and let the transport layers make decisions.


agree, but do you agree that we also need to support those ULPs that don't have this rich api? i mean, that we need to support existent ULP without requiring modifications to benefit from the shim?


I'm also wondering if the failure detection mechanism should be a
pluggable
option; I've been thinking how the shim layer would work in wireless
environments,
and I don't think I'd like to have fast heartbeating, as that would
drain
batteries unnesscesarily.  Could it be possible to have this
functionality
as optional?

Well, i guess that many folks have considered that when positive feedback from the ULP is provided, then heartbeats would be omited, is this what you had in mind?


regards, marcelo



John