[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: shim6 @ NANOG (forwarded note from John Payne)



On 24-feb-2006, at 19:47, Jason Schiller (schiller@uu.net) wrote:

I am baffled by the fact that Service Provider Operators have come out in this forum, at the IAB IPv6 multihoming BOF, and other places, and have
explained how they and their customers use traffic engineering, yet up
until now, shim6 has not tried to provide thier needed functionality.

I think what we have here is a disconnect between what's going on in the wg (and the multi6 design teams) and what's visible from the outside.

I remember MANY conversations, in email and during meetings, about traffic engineering. And for me, there has never been any question that traffic engineering is a must-have for any multihoming solution. Paying for two (or more) links and only being able to use one 99% of the time is simply too cost-ineffective. And just maybe we can convince people that shim6 makes for good multihoming even though it doesn't give you portable address space, but it's never going to fly if the TE is unequivocally worse than what we have today. (And I've said this in the past.)

However, for a number of reasons this isn't all that apparent to an outside observer:

- part these conversations were on closed design team lists, private email or in (design team/interim) meetings (for instance, only 3% of the messages in multi6 for the last couple of years mention TE) - I don't think any of us, but at least not me, saw TE as a particularly hard-to-solve problem - TE can only happen if the base mechanisms are well understood, so were focussing on those first

This is part of the reason more service providers are not envolved in the
IETF.

"You have to do what we want or we'll boycot you"? This way, only five people would be active in the IETF...

The other part as KC Claffy points out is cost
http://www.arin.net/meetings/minutes/ARIN_XVI/ ppm_minutes_day1.html#anchor_8

[Debugging broken PMTUD over IPv6 at ARIN]

I'm not sure which statement about cost you are referring to, or why.

Some history...

1. RFC-3582 attempts to document IPv6 multi-homing requirements.

Forget this RFC, it exists because of the inner workings of the IETF; it doesn't do anything useful in the real world.

2. I tried to document the basic building block for TE.
-Primary / backup
-Load all links as best as possible
-Use best path
-any combination of these basic building blocks
-additional ability to increase or decrease traffic for any of these

The response I get is do people actully do this?

What I said was that I didn't understand why people want to have two links and then have the second one sit idle until the first fails. I know people want this because I used to configure this for customers when I worked at UUNET NL. But my thinking is that if you have multiple links, you'll want to use all of them.

3. IAB IPv6 multi-homing BOF

It seems to me that Service Provider Operators made a very clear statememt
at the BOF.
-Traffic engineering is needed day 1.

I agree with that one.

  * Traffic engineering should not be an end host decesion, but an
    end site (network level) decesion [managing on the end host is
    the wrong place]

If hosts can do congestion control they can do traffic engineering. The only question is how to get site-wide policies into hosts.

  * Traffic engineering needs to support in-bound and out-bound
    traffic mamagement

Sure.

  * Traffic engineering needs to be allowed by transit ASes as well
as end site ASes [don't leave all ISP TE in the hans of our customers]

Are you saying that if I have two ISPs, those get to decide how I balance my traffic over them? What if they turn this knob in opposite directions?

Although I think it's useful for networks in the middle to be able to express some pushback, I'm not sure if this is implementable for sites that don't have a full BGP feed, and if it turns out this is impossible or too hard to implement, I don't think that's a fatal flaw. You don't get to push back on single homed customers either.

-First hit is critical
  * establishing shim6 after the session starts doesn't help
    short lived sessions

I'm not sure where this comes from. Since shim6 doesn't come into play until there is a failure, and failures are too rare to be meaningful in TE, the shim6 failover protocol itself is fairly meaningless for TE. What we need is mechanisms to do source/ destination address selection in a way that can be traffic engineered. Length of individual sessions is meaningless as shim6 doesn't work per-session. Most short sessions are part of a longer lived interaction (i.e., a user visiting a WWW server and retrieving dozens or hundreds of resources over the course of a dozen seconds to many minutes).

  * Keeping shim6 state on the end host doesn't scale for content
providers. A single server may have 30,000 concurrent TCP sessions

Right. So there is precedent for storing state for 30000 instances of "something". Servers are getting a lot faster and memory is getting cheaper so adding a modest amount of extra state for longer lived associations shouldn't be problematic.

(Visit a run of the mill content provider and see how many 100 byte GIFs they send you over HTTP connections that have 700 - 1200 byte overhead and of course all have high performance extensions turned on for extra bandwidth wastage.)

-Maybe 8+8 / GSE seems to be a better starting point to support transit AS
 TE and to avoid the first hit problem and still allow for an "easy"
 multi-homing for consumer customers ?

8+8/GSE won't work: it doesn't tell us how to do failover, it requires changes to TCP and other upper layer protocols, and the locator-identifier binding is insecure. On the surface, it may seem that TCP/IP as we know it today is insecure to begin with, so the GSE/ 8+8 insecurity doesn't add new holes. Unfortunately, it does. With IP as it is today, when I want to pretend that I'm www.yahoo.com at the IP level, I have to send out packets with a source address that matches www.yahoo.com (which is generally easy) but I also have to make sure that packets toward that address get back to me. On an insecure (wireless) LAN this is easy, but once the packet ends up at an ISP network, this isn't easy to do, and almost impossible to hide. With 8+8 on the other hand, I can just create a packet that has the Yahoo identifier, and my locator. This way, I can very easily get my victim to talk to me while thinking he is talking to Yahoo.

Funny thing: you can look at shim6 as a next generation of GSE/8+8 (16 +16) that removes the problems listed above.

The response sounds to me that shim6 wg is finally interested in
considering decent TE as a "requirement". Yay! But I am concerned about
what Operators and IETF folk think is "decent TE",

Let me speak for myself and speculate a bit: what we should do is have multihomed sites publish SRV (or versy similar) records with two values: a "strong" value that allows primary/backup mechanisms, and a "weak" value that allows things like 60% of all sessions should go to this address and 40% to that one.

Then, before a host sets up a session it consults a local policy server that adds local preferences to the remote ones and also supplies the appropriate source address that goes with each destination address. New mechanisms to distribute this information have been proposed in the past, but there is already a service that is consulted before the start of most sessions, so it makes sense to reuse that service. (No prizes for guessing what service I'm getting at.)

This would allow for pretty fine tuned incoming TE, as long as the other end doesn't have a reason to override the receiving site's preferences.

I also imagine some use of measured and synthetic round trips to select the "fast" path where possible. This can't be done in BGP: BGP is pretty good at avoiding very bad paths, but it's not so good at selecting the best ones.

|Yuck, you should never announce more specifics for this.

Please beleive the DFZ Service Provider's when the explain how they, and
their customers do TE.

I believe that they do it, because I see that the global routing table has increased by 16% last year. I have to admit that I've done this myself from time to time, but only if AS path prepending (or changing the origin attribute) wouldn't result in something reasonable. It seems to me that for many people deaggregating is the default these days. And then not just breaking a /20 into two /21s, but go for broke and announce 16 /24s, who cares?

Take the picture below where cust1 has connectivity to UUNET and
at&t. cust2 has connectivity to Sprint and L(3). UUNET, at&t, Sprint,
and L(3) all peer with each other.

       UUNET---Sprint
      / |   \  /   | \
     /  |    \/    |  \
cust1   |    /\    |   cust2
     \  |   /  \   |  /
      \ |  /    \  | /
       at&t------L(3)

-cust1 pay a flat rate to at&t and per packet to UUNET.
-cuts1 prefers to use the at&t link as primary (in and out bound)
-cust1 sends BGP comunity 701:80 to UUNET, and UUNET sets a local pref of
 80 on behalf of the customer

-cust2 has more out bound than in bound traffic.
-cust2 wants to load share all out bound traffic across both links
-cust2 wants traffic delivered to it over the "best" path

Traffic from cust1 to cust2
---------------------------
1. cust1 will send the traffic to at&t
2. at&t will decide if it is better to deliver traffic to cust2
   via the exit point to L(3) or via the exit point to Sprint
3A. If at&t thinks the Sprint exit is more prefered, then
    Sprint should deliver traffic to its customer over the
    Sprint-cust2 link
3B. If at&t thinks the L(3) exit is more prefered, then
    L(3) should deliver traffic to its customer over the
    L(3)-cust2 link

*In this case at&t can do some TE.  Sprint may actully be
 closer or further than L(3), or at&t may  artificially
 distance or shorten Sprint, or may force certain prefixes
 to prefer Sprint or L(3) [this is usally only the case for
 purchased transit and not peering]

So far so good. Note that with shim6, it's possible (although probably hard to do in practice) for cust1 to use four different paths: uunet->sprint, at&t->l3, but also uunet->l3 and at&t->sprint. So in the presence of congestion or scenic routing, there is a much better chance for the customer to utilize the optimal path.

This is both good and bad for ISP/carriers as the customer experience improves, but customers will more actively avoid "bad" paths so they can't get away with those as much as they can now.

Traffic from cust2 to cust1
---------------------------
1. cust2 will spray traffic to Sprint and at&t
2A. UUNET is not advertising cust1 routes to Peers as
    the best path is learned from a Peer and UUNET does
    not provide transit to Peers.
3A. L(3) and Sprint will forward traffic to at&t
4A. at&t will forward traffic to their customer over the
    at&t-cust1 link

2B. at&t is customers of UUNET instead of a Peer.
    In this case UUNET will advertise the cust1
    prefic to L(3) and Sprint.
3B. L(3) and Sprint will choose the best exit and
    send the traffic either to at&t or to UUNET
4B. Traffic sent to UUNET will be delivered to at&t as
    UUNET will honor the customer's low local pref community
    Traffic sent to at&t (either from UUNET or L(3) or Sprint)
    will be delivered over the at&t-cust1 link.

With shim6 and the TE I outlined earlier a correspondent would be able to override the receiving site's wishes, which isn't possible in the above scenario. However, it's unlikely that correspondents will do this on a wide scale unless there is some reason why this is beneficial to them.

In shim6 if cust1 chooses the Sprint IP address as the destination
then all transit ASes must deliver the traffic via Sprint.  Transit
ASes have no capability to understand the destination lives behind
both Sprint and L(3), and threfore deliver the traffic to L(3) if
the L(3) exit point is better.

If the shim6 sites have access to a BGP feed they can still do outgoing traffic engineering as usual. However, I expect that only a subset of all shim6 sites will bother to run BGP so many will have to depend on end-to-end information which will often be better than what BGP supplies, and sometimes (a lot) worse, but never as easy to change by ASes in the middle.

Transit AS TE is more critical in the case of moderate sized transit AS that is purchasing transit from multiple upstreams. Especally when links are cost prohibative. Take a large South American ISP that has 16 STM-1s,
where 4xSTM1 use the Americas 2 oceananic cable system to up stream
transit provider1, 4xSTM1 use the Emergia oceananic cable system to up
stream transit provider1, 4xSTM1 use the Americas 2 oceananic cable system
to up stream stream transit provider2, and 4xSTM1 use the Americas 2
oceananic cable system to up stream stream transit provider2. Now imagine
that your most important customer who always complains about latency
should always use the Americas 2 oceananic cable system to up stream
tranist provider1.  Also imagine all other traffic should load all the
other links as equally as possible. and given that any one or more links fail, all the links should be loaded as equally as possible. Note: This
is just one example of a real world customer.

Unfortunately this is incompatible with hop-by-hop forwarding for outgoing traffic from the customer. Obviously this can be solved both today and with shim6 using MPLS or similar.