CloudTalk: Enabling Distributed Application Optimisations in Public Clouds

Clouds offer an opaque I/O API to their customers: details of the underlying resources (network topology, disk drives) or their current load are kept hidden. Tenants can profile the I/O performance in their VMs and optimise accordingly, but the side effect is increased load. Certain cloud providers try to discourage profiling by enforcing strict I/O isolation, at the cost of reduced utilisation in the average case. In this paper we challenge this status quo and propose CloudTalk, an API that allows tenants to communicate with the cloud provider and receive hints used to optimise their workloads. We have built a distributed implementation of CloudTalk that scales to hundreds of machines and provides significant performance benefits in many cases. Further, we have implemented changes to Hadoop and HDFS that use CloudTalk to decide which machines to use for task placement and replica selection. Our experiments in a local cluster and on Amazon EC2 show that CloudTalk helps improve performance by as much as two times for a wide range of scenarios.

[1]  Hari Balakrishnan,et al.  Choreo: network-aware task placement for cloud applications , 2013, Internet Measurement Conference.

[2]  Benjamin Farley,et al.  More for your money: exploiting performance heterogeneity in public clouds , 2012, SoCC '12.

[3]  Alexandru Iosup,et al.  On the Performance Variability of Production Cloud Services , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[4]  Michael I. Jordan,et al.  Managing data transfers in computer clusters with orchestra , 2011, SIGCOMM.

[5]  Jan Seedorf,et al.  Application-Layer Traffic Optimization (ALTO) Problem Statement , 2009 .

[6]  Prateek Sharma,et al.  SpotCheck: designing a derivative IaaS cloud on the spot market , 2015, EuroSys.

[7]  Haitao Wu,et al.  ICTCP: Incast Congestion Control for TCP in Data-Center Networks , 2013, IEEE/ACM Transactions on Networking.

[8]  Srikanth Kandula,et al.  Leveraging endpoint flexibility in data-intensive clusters , 2013, SIGCOMM.

[9]  Amin Vahdat,et al.  Hedera: Dynamic Flow Scheduling for Data Center Networks , 2010, NSDI.

[10]  Brighten Godfrey,et al.  Finishing flows quickly with preemptive scheduling , 2012, CCRV.

[11]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[12]  Vasileios Pappas,et al.  Improving the Scalability of Data Center Networks with Traffic-aware Virtual Machine Placement , 2010, 2010 Proceedings IEEE INFOCOM.

[13]  Ling Liu,et al.  Purlieus: Locality-aware resource allocation for MapReduce in a cloud , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[14]  David A. Maltz,et al.  Network traffic characteristics of data centers in the wild , 2010, IMC '10.

[15]  Azer Bestavros,et al.  On the marginal utility of network topology measurements , 2001, IMW '01.

[16]  Antony I. T. Rowstron,et al.  Bridging the tenant-provider gap in cloud services , 2012, SoCC '12.

[17]  Albert G. Greenberg,et al.  VL2: a scalable and flexible data center network , 2009, SIGCOMM '09.

[18]  Nick McKeown,et al.  pFabric: minimal near-optimal datacenter transport , 2013, SIGCOMM.

[19]  Ion Stoica,et al.  Coflow: a networking abstraction for cluster applications , 2012, HotNets-XI.

[20]  Michael Abd-El-Malek,et al.  Omega: flexible, scalable schedulers for large compute clusters , 2013, EuroSys '13.

[21]  Ohad Shamir,et al.  On-demand, Spot, or Both: Dynamic Resource Allocation for Executing Batch Jobs in the Cloud , 2014, ICAC.

[22]  Mark Handley,et al.  Improving datacenter performance and robustness with multipath TCP , 2011, SIGCOMM 2011.

[23]  Ion Stoica,et al.  Efficient coflow scheduling with Varys , 2015, SIGCOMM.

[24]  Meng Wang,et al.  Consolidating virtual machines with dynamic bandwidth demand in data centers , 2011, 2011 Proceedings IEEE INFOCOM.

[25]  Abdul Kabbani,et al.  FlowBender: Flow-level Adaptive Routing for Improved Latency and Throughput in Datacenter Networks , 2014, CoNEXT.