[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: shim6 @ NANOG (forwarded note from John Payne) (fwd)




El 08/03/2006, a las 10:28, Igor Gashinsky escribió:

Hi Marcelo,

	My comments are in-line... sorry for the late reply, but I've been
traveling too much lately...

:: El 01/03/2006, a las 10:10, Igor Gashinsky escribió:
:: So the effort for this case imho is putted in enabling the capacity or
:: establishing new sessions after an outage rather than in preserving
:: established connections, do you think this makes any sense to you

This makes a lot of sense, provided this happens under the hood of the
application (ie web-browser in this case). So, right now, for example, if a client is pulling down a web page, gets the html, and and in the middle
of downloading the .gif/jpg his session dies (ie TCP RST), the jpg that
the client was in the middle of x-fering will get that ugly red "X".
(most browsers, right now, will not re-try to get the object again, and
will just show it as unavailable). This issue is deemed important enough
that most large content providers are spending an inordinate amount of
money on loadbalancer with active session sync to try to prevent that
from happening in the even of a loadbalancer fail-over. So, if
application behavior could be changed to say "if shim6 fail-over is
possible, and connection just died (for any definition of die), then
attempt re-establish connection through the shim, and then to re-get the failed object", that would go a long way in making this kind of fail-over
better.


This is possible with the shim6 protocol, since it supports unreachable ulids when establishing the shim context, so i guess this would be ok. Probably a couple of elements that are needed, like an extended API to allow the apps to tell this to the shim (you probably want also to inform the shim which locator is not working), and the shim needs to remember the alternative locators obtained from the DNS even if there is not shim context yet, in order to have a clue about which alternative address to use (the other other option is to perform a reverse lookup for retrieving those... see the threat with Erik for more about this point). But in any case, i think all this issues are easily solvable

but, i have an additional question about this point. the point is, if the application is the one that will determine that there is a problem and will ask the shim to establish a context (which is ok and no problem here) wouldn't the application be better off simply retrying with alternative locators by itself, rather then asking the shim to do it?

The difference with shim6, instead of v4 is that in v4 world, the
connection wouldn't die, it would just hang for the duration of
convergence (provided convergence is fast enough, which normally it is),
and then continue on it's merry way with new tcp windows. In Shim6, if
the client[ip1]-server connection goes down, re-establishing to
client[ip2]-server would not be "hitless" (ie session would die), and to solve that problem we are back at either keeping an inordinate amount of
state on the webservers (which is not very realistic), a shift in the
way people write applications (which, in my opinion is preferred, but a
*very* hard problem to solve), or to somehow figure out how to hide this
in the stack with minimal performance hit (let's say sub 1% memory hit)
when you have 30k+ simultaneous connections per server...

well if you use the shim approach that you suggest above, the server does not have to store any shim state while things are doing fine and if a client detects a problem it can trigger the creation of the shim context from the client to the server. At this point, the server will need some shim state, but only for those connections that have failed (of course if one of the links to the server went down, then all the clients connecting through that link will attempt to create a shim state)

I guess that this could be a reasonable trade-off between state in the server and response time when outages occur


:: > 3) While TE has been discussed at length already, but it is something :: > which is absolutely required for a content provider to deploy shim6. There :: > has been quite a bit of talk about what TE is used for, but it seems that
:: > few people recognize it as a way of expressing "business/financial
:: > policies". For example, in the v4 world, the (multi-homed) end-user maybe :: > visible via both a *paid* Transit path (say UUNET), and a *free* peering :: > link (say Cogent), and I would wager that most content providers would :: > choose the free link (even if performance on that link is (not hugely) :: > worse). That capability all but disappears in the v6 world if the Client :: > ID was sourced from their UUnet ip address (since that's who they chose :: > to use for outbound traffic), and the (web) server does not know that :: > that locator also corresponds to a Cogent IP (which they can reach for
:: > free).
::
:: I fail to understand the example the you are presenting here...
::
:: are you considering the case where both the client and the server are both
:: multihomed to Cognet and UUnet?
:: something like
::
:: UUnet
:: /     \
:: C       S
:: \     /
:: Cognet

Yes, but now imagine the the "C" in this case is a client using shim6 with multiple IP's, and the server is in IPv6 PI space. Also, if it wasn't in PI space, the connection to the server *can* be influenced via SRV (although that's trying to shoehorn DNS into where perhaps it shouldn't go -- since
now the DNS server needs to be aware of link-state in the network to
determine if the UUnet/Cogent connections are even up, and for a
sufficiently large "S", that could be 10's, or even 100's of links, which
presents a very interesting scaling problem for DNS.. (even more
interesting is that most large content providers are actually in the
1000's, and that's why they can get PI space -- they are effectively (at
least) a tier-2 ISP). But, back to the example at hand.. so, for the
sake of this example, let's say that the UUnet port is $20/Mbps, and
the Cogent port is a SFI (free) peer. So, the client (with ips of
IP-uunet and IP-cogent) picks IP-uunet (because they want to use their
UUnet connection outbound) to initiate a connection to the server, the
problem now comes from the fact that the server, when replying to the
client is unaware that IP-cogent IP is associated with the client,
(since the shim layer has not kicked in on initial connect) and will have
to send traffic through the very expensive UUnet port.

that i don't follow

suppose that the server has v6 PI addresses, which for very big sites makes sense imho

The server can send traffic with destiantion address belonging to UUNet through Cognet, right? I mean i am assuming that UUNet and Cognet have connectivity that is not through S

I mean, The client can choose to use the IP from UUNet (that is his choice and he has the right to do so, because he is paying for it) This choice, affects the ISP used to get _to_ the client and it shouldn't determine the ISP used to get to the Server

So in this case the traffic would flow:
From the client to the Internet through UUNet
From the internet to the Server through Cognet

agree?

Now the problem is when the server also has PA blocks

In this case, the destiantion address selected by the client will determine the ISP of the server

Without shim, the server don't have many options, basically what he could do is to use the DNS to prioritize the Cognet addresses. With the shim, the server can rehome any communication that is using UUnet addresses to Cognet and start using Cognet locators. This of course does not prevent the client to keep on using the UUnet destination addresses. In this case, the server can inform the client about his preferences using a shim protocol option, but even in this case the client can prefer other than what is expressed by S in the preferences. In any case, in this model, each can always choose the path used to send packets. I guess that in IPv4 is somehow different because the decision belongs to the intermediate ASes, which are to ones that can select which path to use (note that in this case, is not S who is in charge to select the incoming path neither)

 With v4, on the
other hand, the router was aware that Client is reachable via both Cogent and UUnet, and could have had a localpref configured that would just say "anything reachable over cogent, use cogent". One way to fix that would be
to do a shim6 init in the 3way handshake, but the problem then becomes
that *every* "S" would have to have a complete routing table, and
basically, perform the logic that is done in today's routers.

why is that?

I mean if S prefers Cognet, all he has to do is:
- In the PI case, route its outgoing packet through Cognet and do the same v4 bgp magic to direct incoming packet through cognet - In the PA case, always use cognet addresses and try to convince the clients to use the server's IP address of the cognet prefix (through SRV and/or shim preferences)

Obviously
running Zebra w/ full routes on a server is a non-trivial performance hit,
and multiplied that out by the number of servers, and it gets
very expensive, very fast. All to re-gain capabilities we have right now
in ipv4 for free...

Now, of course, the "so called easy" answer would be "let's introduce a
routing policy middleware box that would handle that part". That box would
have the full routing tables, the site policies, and when queried with
"I'm server X, and this is the client and all his locators, which one do I
use?" would spit back an answer to that server that would be a fully
informed decision, and the TE problem becomes mostly solved. I say


but there seems to be two different problems here (at least :-)
- one: which are the TE capabilities available with the PA addressing model + the shim tool. This is what this can be done in this case. - second: who is in control of these capabilties and how are they managed i.e. who controls the policy and who manages the devices that are in control of the policy. Is it possible to have a centralized policy management? is it possible to enforce the usage of policy (at least within the multihomed site)?

I guess that before we were considering the problem one and now the second one...

This server idea that you are considering was presented by Cedric de Launois in a work called NAROS a while ago

The other option is what we are discussing below about using a DHCP/RAdv option to distribute the policy information among the hosts

The other option is to move to a scheme based on rewriting source prefixes

Or a combination of those

"mostly", because now there are these pesky issues of a) do I trust that the server is going to obey by this decision (either hacked, or is a box
outside of my administrative control, yet is within the scope of my
network control); b) how do transit ISP's "influence" that decision (at
some point I cross their network, and they should be able to control how
the packets are flowing through their network; c) how do I verify that
their "influencing" doesn't negate mine, and is legitimate; d) how much
"lag" does it introduce into every session establishment, and is it
acceptable; d) can this proxy scale to the number of queries fired at it,
and the real-time computations that would have to happen on each one
(since we can't precompute the answers); and finally is it *really* more
cost-effective then doing all this in routers.

So far, I'd rather pay for bigger routers...

:: I mean in this case, the selection of the server provider is determined by
:: the server's address not by the client address, right?
:: The server can influence such decision using SRV records in the DNS, but not
:: sure yet if this is the case you are considering

See above about difficulties of scaling DNS to meet this goal...


but the problem with the DNS that you have considered above is about having to achieve that the DNS publish information that reflect the state of the links. This seems indeed very dificult, especially becuase of cached information and so on. But as far as i know, no one is proposing this. The idea is to use SRV records to express policy, and a not very dynamic one. I mean, you can express that like 30% of the communications needs to use a given address and the others the other address and so on, but the idea is not to allow the DNS to reflect the state of the network Actually, it may happen that some of the addresses in the DNS are down. In this case, the idea is to let the hosts to detect this an retry using alternative addresses. whether this retrial is visible or not to the apps, is still an open issue

:: > This change alone would add millions to the bw bills of said
:: > content providers, and well, reduce the likelyhood of adoption of the
:: > protocol by them. Now, if the shim6 init takes place in the 3way
:: > handshake process, then the servers "somewhat" know what all possible :: > paths to reach that locator are, but then would need some sort of a :: > policy server telling them who to talk to on what ip, and that's something
:: > which will not simply scale for 100K+ machines.
:: >
::
:: I am not sure i understand the scaling problem here
:: Suppose that you are using a DHCP option for distributing the SHIM6
:: preferences of the RFC3484 policy table, are you saying that DHCP does not :: scale for 100K+ machines? or is there something else other than DHCP that

Well, first, show me a content provider who thinks that dhcp scales for a datacenter (other then initial pxeboot/kickstart/jumpstart, whatever), but
that aside, running zebra/quagga + synchronizing policy updates among
100K+ machines simply does not scale (operationally).


So, you are considering here the case where policy is changed according to the state of the network, right? So that BGP information is used as feedback to the TE decision, is that correct? Is this possible today? how is it done? could you provide an example of how you use this dynamic TE setting?


:: > 4) As has also been discussed before, the initial connect time has to be :: > *very* low. Anything that takes longer then 4-5 seconds the end-users have :: > a funny way of clicking "stop" in their browser, deeming that "X is down, :: > let me try Y", which is usually not a very acceptable scenario :-) So, :: > whatever methodology we use to do the initial set-up has to account for :: > that, and be able to get a connection that is actually starting to do :: > something in under 2 seconds, along with figuring out which sourceIP and
:: > destIP pairs actually can talk to each other.
::
:: As i mentioned above, we are working in other mechanisms than the shim6 :: protocol itself that can be used for establishing new communication through
:: outages.
::
:: you can find some work in this area in
::
:: ftp://ftp.rfc-editor.org/in-notes/internet-drafts/draft-bagnulo-ipv6-
:: rfc3484-update-00.txt

It's a fairly good idea of negotiating which SRC and DEST ips to pick, but
it has to happen *fast* (ie sub 2 seconds), or the end-users will lose
patience, and declare the site dead. Perhaps racing SYNs?

yes, this is an option and it is nice because you actually get not only to detect which ones are actually working but also to pick the fastest one. But clearly there is the cost of the additional SYNs you send that is basically overhead... would you be willing to pay for this multiple SYNs?


Now, I'm not saying that all these problems can't be solved for people to
consider shim6 a viable solution, but so far, they aren't solved, and
until they are, I just don't see recommending to my employer to take shim6
seriously,

I may well agree with you here, but remeber that we are still defining the protocol :-)


I guess the point here is how we can manage to provide a solution that fits the site's requirements, hence your feedback is very valuable

Regards, marcelo


 since it seems like all it's going to do is to move the costs
elsewhere, and quite possibly increase them quite a bit in the process...

-igor