[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RRG] Perplexing PMTUD and packet length observations

To: Routing Research Group <rrg@psg.com>
Subject: Re: [RRG] Perplexing PMTUD and packet length observations
From: Robin Whittle <rw@firstpr.com.au>
Date: Tue, 12 Aug 2008 20:01:47 +1000
Cc: Iljitsch van Beijnum <iljitsch@muada.com>
In-reply-to: <0A570730-317F-4336-88BA-A89A9EA2C57E@muada.com>
Organization: First Principles
References: <48A06F15.6040809@firstpr.com.au> <3F3B7EE7-01DC-45FA-BF8B-CA74D99838DE@muada.com> <39C363776A4E8C4A94691D2BD9D1C9A104D5D516@XCH-NW-7V2.nw.nos.boeing.com> <0A570730-317F-4336-88BA-A89A9EA2C57E@muada.com>
User-agent: Thunderbird 2.0.0.16 (Windows/20080708)

Hi Iljitsch,

Before I reply to what you wrote, on the matter of my server sending
out jumboframe packets, with DF=1, and them being received correctly
by clients on DSL services:

I was able to reproduce the behavior and have documented it clearly
here:

http://www.firstpr.com.au/ip/ivip/ipv4-bits/actual-packets.html#2008-08-12

I just updated to a recent kernel 2.6.18-92.1.10.el5 and the
behavior continues. (This is Centos 5.1, so maybe we are seeing
something Red-Hat specific?)

My guess is the kernel is bundling the data of multiple full MSS
length TCP packets produced by httpd and emitting them as single
packets 2, 3, 4 etc. to 6 times as long.  I haven't yet found out
what aspect of the kernel (or httpd?) is doing this.  To me, it is
very surprising behavior - and violates the TCP protocol by sending
packets way longer than the MSS limit.

You mentioned some systems sending longer than MSS TCP packets to
hosts on the local network.  Can you recall what this feature might
be called?

These jumbo frames presumably reach the router at the ISP end of the
PPPoE DSL link, and that router does something I found surprising,
and again at odds with my understanding of what is righteous and
proper:  It splits up the data from this one long TCP packet and
sends it, (with sequentially numbered Identification fields) as
separate TCP packets to the destination.  These are not fragments -
they are standalone TCP packets.

It works fine for TCP - but would not for UDP or TCP with IPsec
Authentication Header.  I don't like my own server and DSL link
doing things I don't understand and can't find any documentation for.

You wrote:

>> I expected that most hosts would send packets with DF=1, using
>> RFC 1191 PMTUD, since it has been around since 1990 . . .
>
> Most hosts do. It's enabled by default on Win, Mac, Linux and
> FreeBSD. Probably more.

Yes.

>> and that it would be considered antisocial to send any longish
>> packets into the Net with DF=0
>
> Why?
>
>> saying the Network should do extra work if the packets are too
>> long for it.

When RFC 1191 has been around since 1990 and provides a civilised
way of sending large packets, without expecting some router in the
path to fragment packets and the rest of the routers to have to deal
with twice the number of packets.

> Well, it's better than setting DF and ignoring the ICMP too bigs,
> like some people do (see recent thread on NANOG).

That would be antisocial and stupid!

I couldn't easily find the thread at:

  http://www.merit.edu/mail.archives/nanog/

>> In fact, some servers - including at least some Google servers,
>> send only DF=0 packets, and therefore never do any PMTUD.  They
>> won't be tempted by the client having a higher MSS than their own
>> 1430, but as long as the client's MSS is this or above, the
>> server frequently sends 1470 byte packets with DF=0.
>
> Hm, didn't know that, but yes, that's what I'm seeing too.

OK - my page has test cases.

http://www.firstpr.com.au/ip/ivip/ipv4-bits/actual-packets.html#google-no-pmtud

>>  1 - I guess Google figure 1470 is a packet size which generally
>>      works in the core.
>
> Or they use a 30 byte encapsulation somewhere.

Maybe so, but I guess they are not so irresponsible as to be
systematically sending out DF=0 packets which were causing
widespread fragmentation.

>>  2 - I guess they figure that if the client advertises an MSS and
>>      when 40 (IP and TCP headers) is added to that figure if the
>>      result is larger than any MTU limit between the client and
>>      Google's servers, then it is the client's problem.
>
> ??? Why would advertising a large MSS be a problem? You send what
> the other advertises he/she can handle and obviously _they_ will
> be sending you what they can handle.

Yes, but what if, for some reason, there is a router in the path
with a smaller MTU than is generally seen by the client or by Google?

>> If lots of hosts are sending long packets with DF=0, we need to
>> cope with then in any map-encap scheme.
>
> Like I've been saying for a long time, reducing the user-visible
> MTU of the internet is not an acceptable approach.

I have figured out how to do ITR - ETR packet delivery without
encapsulation or address rewriting for IPv6:

  http://www.firstpr.com.au/ip/ivip/ivip6/

and yesterday figured out a somewhat similar scheme for IPv4.  I
will write them up as Internet drafts ASAP.  The both involve
changes to the use of header bits, and fairly minimal modifications
to core routers.  That makes these schemes much harder to deploy,
but there would be no PMTUD problems or lost bandwidth due to
encapsulation - which are both enormous benefits, including much
less complexity in ITRs and ETRs.

> BTW: if you haven't seen it before, have a look at this before it
> expires two weeks from now:
> http://www.ietf.org/internet-drafts/draft-van-beijnum-multi-mtu-02.txt

OK.  I was intrigued by the list of common MTU values you list.
Googling reveals the source:

http://www.ietf.org/mail-archive/web/ram/current/msg01854.html

in which you wrote:

> I've looked over some product literature and compiled the
> following list: 1508, 1530, 1536, 1546, 1998, 2000, 2018, 4464,
> 4470, 8092, 8192, 9000, 9176, 9180, 9216, 17976, 64000 and 65280.

I guess it makes sense to have mechanisms by which a host can
discover some MTU values for nearby routers and other hosts.

I am trying to do something different - find methods of delivering
packets for an Edge-Core Separation solution to the routing scaling
problem which doesn't upset conventional RFC 1191 PMTUD.

> ... ICMP too big messages are ESSENTIAL and MUST NOT be
> filtered! Any other message from the IETF is unacceptable.

I agree absolutely.

I think there is no workable alternative to RFC 1191 PMTUD.

RFC 4821 is so difficult to implement, since it involves the
packetisation layers of all UDP applications talking back and forth
with the operating system, and with the TCP packetisation layer, all
comparing notes giving and sharing information.

It would be a nightmare to debug, and involves flows of information
about specific hosts in ways which I guess are not normally done
inside an operating system.

How is the system to respond to packet loss - not caused by MTU
problems, but which is indistinguishable from loss caused by MTU
problems?  The estimate would have to be reduced - clobbering all
applications with shorter packets, just because a few were lost, or
their acknowledgement packets were lost.

RFC 4821 sounds too fluffy - and all too difficult - with only
marginal benefits accruing if all the apps do it well and the OS
integrates and shares the information well.

What is to stop a bad application from failing in a way it thinks is
an MTU problem, dragging all the other apps down to using a short
packet length?

> The first mistake was to invent the DF bit in the first place.

I guess you mean that all packets should always have been
non-fragmentable and that something like RFC 1191 should always have
been in existence.

> The second mistake is to suggest that the DF bit be set for ALL
> packets to do PMTUD in RFC 1191.

I don't understand your objection.

Removing fragmentation from the network is a really good aspect of
IPv6, I think.  Ideally, I think, all packets should be sent DF=1
and all applications should be ready to cope if their packets
(longer than some basic assumed minimum length) generate ICMP PTB
messages.  The trick is to be able to securely receive ICMP PTB
messages.  Without a nonce, this gets messy, and relies on the ICMP
code keeping records of recently sent packets, or at least the start
of them.  How recent?  How many bytes to store?  There seems to be
no hard answers.

It is difficult, but what else is there, since the MTU could change
at any time.  We can't be sending big probe packets every 30 seconds
in case the MTU drops.  We can't assume that lack of acknowledgement
is an MTU problem - it could be a temporary outage or destination
host going dead for a moment.  So it would be silly adjusting down
MTUs and trying to send smaller packets and smaller packets, until
when an ACK arrives we decide the problem must have been the PMTU
dropping!

The only reasonable solution seems to be send all packets DF=0 and
expect all routers to report PMTU troubles with a PTB message.
Networks which block PTB packets are doing themselves and anyone who
connects to them a grave disservice.

> I'm not sure if implicitly making IPv6 packets unfragmentable was
> mistake, but relying on ICMP messages was.

Do you suggest some other kind of message, or do you think PMTUD
should be done on the basis of positive acknowledgements alone, with
silent discarding of a too-big packet at whichever router can't
handle it?

> In any event, all of these mistakes have been made and we aren't
> even close to cleaning up the mess with stuff like RFC 4821,
> so now we have to live with that, which means, among other things,
> taking very good care of ICMP too big messages.

Google:   No results found for "RFC 4821 deployment".

Blocking ICMP PTB messages is a dumb thing to do.  I understand
people wanting to block ping or ping response packets - but I think
that PTBs are the best and probably the only practical method for
doing PMTUD.

 - Robin

--
to unsubscribe send a message to rrg-request@psg.com with the
word 'unsubscribe' in a single line as the message text body.
archive: <http://psg.com/lists/rrg/> & ftp://psg.com/pub/lists/rrg

Follow-Ups:
- Re: [RRG] Perplexing PMTUD and packet length observations
  - From: Iljitsch van Beijnum <iljitsch@muada.com>

References:
- [RRG] Perplexing PMTUD and packet length observations
  - From: Robin Whittle <rw@firstpr.com.au>
- Re: [RRG] Perplexing PMTUD and packet length observations
  - From: Iljitsch van Beijnum <iljitsch@muada.com>
- RE: [RRG] Perplexing PMTUD and packet length observations
  - From: "Templin, Fred L" <Fred.L.Templin@boeing.com>
- Re: [RRG] Perplexing PMTUD and packet length observations
  - From: Iljitsch van Beijnum <iljitsch@muada.com>

Prev by Date: Re: [RRG] Perplexing PMTUD and packet length observations
Next by Date: Re: [RRG] Perplexing PMTUD and packet length observations
Previous by thread: Re: [RRG] Perplexing PMTUD and packet length observations
Next by thread: Re: [RRG] Perplexing PMTUD and packet length observations
Index(es):
- Date
- Thread