[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: shim6 @ NANOG (forwarded note from John Payne)

To: Jason Schiller (schiller@uu.net) <jason.schiller@mci.com>
Subject: Re: shim6 @ NANOG (forwarded note from John Payne)
From: Iljitsch van Beijnum <iljitsch@muada.com>
Date: Fri, 24 Feb 2006 23:35:53 +0100
Cc: Mikael Abrahamsson <swmike@swm.pp.se>, shim6-wg <shim6@psg.com>
In-reply-to: <Pine.GSO.4.20.0602241117460.20855-100000@meno.corp.us.uu.net>
References: <Pine.GSO.4.20.0602241117460.20855-100000@meno.corp.us.uu.net>

On 24-feb-2006, at 19:47, Jason Schiller (schiller@uu.net) wrote:

I am baffled by the fact that Service Provider Operators have comeout inthis forum, at the IAB IPv6 multihoming BOF, and other places, andhave
explained how they and their customers use traffic engineering, yet up
until now, shim6 has not tried to provide thier needed functionality.

I think what we have here is a disconnect between what's going on inthe wg (and the multi6 design teams) and what's visible from theoutside.

I remember MANY conversations, in email and during meetings, abouttraffic engineering. And for me, there has never been any questionthat traffic engineering is a must-have for any multihoming solution.Paying for two (or more) links and only being able to use one 99% ofthe time is simply too cost-ineffective. And just maybe we canconvince people that shim6 makes for good multihoming even though itdoesn't give you portable address space, but it's never going to flyif the TE is unequivocally worse than what we have today. (And I'vesaid this in the past.)

However, for a number of reasons this isn't all that apparent to anoutside observer:

- part these conversations were on closed design team lists, privateemail or in (design team/interim) meetings (for instance, only 3% ofthe messages in multi6 for the last couple of years mention TE)- I don't think any of us, but at least not me, saw TE as aparticularly hard-to-solve problem- TE can only happen if the base mechanisms are well understood, sowere focussing on those first

This is part of the reason more service providers are not envolvedin the
IETF.

"You have to do what we want or we'll boycot you"? This way, onlyfive people would be active in the IETF...

The other part as KC Claffy points out is cost
http://www.arin.net/meetings/minutes/ARIN_XVI/ppm_minutes_day1.html#anchor_8


[Debugging broken PMTUD over IPv6 at ARIN]

I'm not sure which statement about cost you are referring to, or why.

Some history...

1. RFC-3582 attempts to document IPv6 multi-homing requirements.

Forget this RFC, it exists because of the inner workings of the IETF;it doesn't do anything useful in the real world.

2. I tried to document the basic building block for TE.
-Primary / backup
-Load all links as best as possible
-Use best path
-any combination of these basic building blocks
-additional ability to increase or decrease traffic for any of these

The response I get is do people actully do this?

What I said was that I didn't understand why people want to have twolinks and then have the second one sit idle until the first fails. Iknow people want this because I used to configure this for customerswhen I worked at UUNET NL. But my thinking is that if you havemultiple links, you'll want to use all of them.

3. IAB IPv6 multi-homing BOF

It seems to me that Service Provider Operators made a very clearstatememt
at the BOF.
-Traffic engineering is needed day 1.


I agree with that one.

  * Traffic engineering should not be an end host decesion, but an
    end site (network level) decesion [managing on the end host is
    the wrong place]

If hosts can do congestion control they can do traffic engineering.The only question is how to get site-wide policies into hosts.

  * Traffic engineering needs to support in-bound and out-bound
    traffic mamagement


Sure.

  * Traffic engineering needs to be allowed by transit ASes as well
as end site ASes [don't leave all ISP TE in the hans of ourcustomers]

Are you saying that if I have two ISPs, those get to decide how Ibalance my traffic over them? What if they turn this knob in oppositedirections?

Although I think it's useful for networks in the middle to be able toexpress some pushback, I'm not sure if this is implementable forsites that don't have a full BGP feed, and if it turns out this isimpossible or too hard to implement, I don't think that's a fatalflaw. You don't get to push back on single homed customers either.

-First hit is critical
  * establishing shim6 after the session starts doesn't help
    short lived sessions

I'm not sure where this comes from. Since shim6 doesn't come intoplay until there is a failure, and failures are too rare to bemeaningful in TE, the shim6 failover protocol itself is fairlymeaningless for TE. What we need is mechanisms to do source/destination address selection in a way that can be trafficengineered. Length of individual sessions is meaningless as shim6doesn't work per-session. Most short sessions are part of a longerlived interaction (i.e., a user visiting a WWW server and retrievingdozens or hundreds of resources over the course of a dozen seconds tomany minutes).

  * Keeping shim6 state on the end host doesn't scale for content
providers. A single server may have 30,000 concurrent TCPsessions

Right. So there is precedent for storing state for 30000 instances of"something". Servers are getting a lot faster and memory is gettingcheaper so adding a modest amount of extra state for longer livedassociations shouldn't be problematic.

(Visit a run of the mill content provider and see how many 100 byteGIFs they send you over HTTP connections that have 700 - 1200 byteoverhead and of course all have high performance extensions turned onfor extra bandwidth wastage.)

-Maybe 8+8 / GSE seems to be a better starting point to supporttransit AS
 TE and to avoid the first hit problem and still allow for an "easy"
 multi-homing for consumer customers ?

8+8/GSE won't work: it doesn't tell us how to do failover, itrequires changes to TCP and other upper layer protocols, and thelocator-identifier binding is insecure. On the surface, it may seemthat TCP/IP as we know it today is insecure to begin with, so the GSE/8+8 insecurity doesn't add new holes. Unfortunately, it does. With IPas it is today, when I want to pretend that I'm www.yahoo.com at theIP level, I have to send out packets with a source address thatmatches www.yahoo.com (which is generally easy) but I also have tomake sure that packets toward that address get back to me. On aninsecure (wireless) LAN this is easy, but once the packet ends up atan ISP network, this isn't easy to do, and almost impossible to hide.With 8+8 on the other hand, I can just create a packet that has theYahoo identifier, and my locator. This way, I can very easily get myvictim to talk to me while thinking he is talking to Yahoo.

Funny thing: you can look at shim6 as a next generation of GSE/8+8 (16+16) that removes the problems listed above.

The response sounds to me that shim6 wg is finally interested in
considering decent TE as a "requirement". Yay! But I am concernedabout
what Operators and IETF folk think is "decent TE",

Let me speak for myself and speculate a bit: what we should do ishave multihomed sites publish SRV (or versy similar) records with twovalues: a "strong" value that allows primary/backup mechanisms, and a"weak" value that allows things like 60% of all sessions should go tothis address and 40% to that one.

Then, before a host sets up a session it consults a local policyserver that adds local preferences to the remote ones and alsosupplies the appropriate source address that goes with eachdestination address. New mechanisms to distribute this informationhave been proposed in the past, but there is already a service thatis consulted before the start of most sessions, so it makes sense toreuse that service. (No prizes for guessing what service I'm gettingat.)

This would allow for pretty fine tuned incoming TE, as long as theother end doesn't have a reason to override the receiving site'spreferences.

I also imagine some use of measured and synthetic round trips toselect the "fast" path where possible. This can't be done in BGP: BGPis pretty good at avoiding very bad paths, but it's not so good atselecting the best ones.

|Yuck, you should never announce more specifics for this.

Please beleive the DFZ Service Provider's when the explain howthey, and
their customers do TE.

I believe that they do it, because I see that the global routingtable has increased by 16% last year. I have to admit that I've donethis myself from time to time, but only if AS path prepending (orchanging the origin attribute) wouldn't result in somethingreasonable. It seems to me that for many people deaggregating is thedefault these days. And then not just breaking a /20 into two /21s,but go for broke and announce 16 /24s, who cares?

Take the picture below where cust1 has connectivity to UUNET and
at&t. cust2 has connectivity to Sprint and L(3). UUNET, at&t,Sprint,
and L(3) all peer with each other.

       UUNET---Sprint
      / |   \  /   | \
     /  |    \/    |  \
cust1   |    /\    |   cust2
     \  |   /  \   |  /
      \ |  /    \  | /
       at&t------L(3)

-cust1 pay a flat rate to at&t and per packet to UUNET.
-cuts1 prefers to use the at&t link as primary (in and out bound)
-cust1 sends BGP comunity 701:80 to UUNET, and UUNET sets a localpref of
 80 on behalf of the customer

-cust2 has more out bound than in bound traffic.
-cust2 wants to load share all out bound traffic across both links
-cust2 wants traffic delivered to it over the "best" path

Traffic from cust1 to cust2
---------------------------
1. cust1 will send the traffic to at&t
2. at&t will decide if it is better to deliver traffic to cust2
   via the exit point to L(3) or via the exit point to Sprint
3A. If at&t thinks the Sprint exit is more prefered, then
    Sprint should deliver traffic to its customer over the
    Sprint-cust2 link
3B. If at&t thinks the L(3) exit is more prefered, then
    L(3) should deliver traffic to its customer over the
    L(3)-cust2 link

*In this case at&t can do some TE.  Sprint may actully be
 closer or further than L(3), or at&t may  artificially
 distance or shorten Sprint, or may force certain prefixes
 to prefer Sprint or L(3) [this is usally only the case for
 purchased transit and not peering]

So far so good. Note that with shim6, it's possible (althoughprobably hard to do in practice) for cust1 to use four differentpaths: uunet->sprint, at&t->l3, but also uunet->l3 and at&t->sprint.So in the presence of congestion or scenic routing, there is a muchbetter chance for the customer to utilize the optimal path.

This is both good and bad for ISP/carriers as the customer experienceimproves, but customers will more actively avoid "bad" paths so theycan't get away with those as much as they can now.

Traffic from cust2 to cust1
---------------------------
1. cust2 will spray traffic to Sprint and at&t
2A. UUNET is not advertising cust1 routes to Peers as
    the best path is learned from a Peer and UUNET does
    not provide transit to Peers.
3A. L(3) and Sprint will forward traffic to at&t
4A. at&t will forward traffic to their customer over the
    at&t-cust1 link

2B. at&t is customers of UUNET instead of a Peer.
    In this case UUNET will advertise the cust1
    prefic to L(3) and Sprint.
3B. L(3) and Sprint will choose the best exit and
    send the traffic either to at&t or to UUNET
4B. Traffic sent to UUNET will be delivered to at&t as
    UUNET will honor the customer's low local pref community
    Traffic sent to at&t (either from UUNET or L(3) or Sprint)
    will be delivered over the at&t-cust1 link.

With shim6 and the TE I outlined earlier a correspondent would beable to override the receiving site's wishes, which isn't possible inthe above scenario. However, it's unlikely that correspondents willdo this on a wide scale unless there is some reason why this isbeneficial to them.

In shim6 if cust1 chooses the Sprint IP address as the destination
then all transit ASes must deliver the traffic via Sprint.  Transit
ASes have no capability to understand the destination lives behind
both Sprint and L(3), and threfore deliver the traffic to L(3) if
the L(3) exit point is better.

If the shim6 sites have access to a BGP feed they can still dooutgoing traffic engineering as usual. However, I expect that only asubset of all shim6 sites will bother to run BGP so many will have todepend on end-to-end information which will often be better than whatBGP supplies, and sometimes (a lot) worse, but never as easy tochange by ASes in the middle.

Transit AS TE is more critical in the case of moderate sizedtransit ASthat is purchasing transit from multiple upstreams. Especally whenlinksare cost prohibative. Take a large South American ISP that has 16STM-1s,
where 4xSTM1 use the Americas 2 oceananic cable system to up stream
transit provider1, 4xSTM1 use the Emergia oceananic cable system to up
stream transit provider1, 4xSTM1 use the Americas 2 oceananic cablesystem
to up stream stream transit provider2, and 4xSTM1 use the Americas 2
oceananic cable system to up stream stream transit provider2. Nowimagine
that your most important customer who always complains about latency
should always use the Americas 2 oceananic cable system to up stream
tranist provider1.  Also imagine all other traffic should load all the
other links as equally as possible. and given that any one or morelinksfail, all the links should be loaded as equally as possible. Note:This
is just one example of a real world customer.

Unfortunately this is incompatible with hop-by-hop forwarding foroutgoing traffic from the customer. Obviously this can be solved bothtoday and with shim6 using MPLS or similar.

Follow-Ups:
- Re: shim6 @ NANOG (forwarded note from John Payne)
  - From: Mikael Abrahamsson <swmike@swm.pp.se>

References:
- Re: shim6 @ NANOG (forwarded note from John Payne)
  - From: "Jason Schiller (schiller@uu.net)" <jason.schiller@mci.com>

Prev by Date: Re: shim6 @ NANOG (forwarded note from John Payne)
Next by Date: Re: shim6 @ NANOG (forwarded note from John Payne)
Previous by thread: Re: shim6 @ NANOG (forwarded note from John Payne)
Next by thread: Re: shim6 @ NANOG (forwarded note from John Payne)
Index(es):
- Date
- Thread