Managing data transfers in computer clusters with orchestra

Cluster computing applications like MapReduce and Dryad transfer massive amounts of data between their computation stages. These transfers can have a significant impact on job performance, accounting for more than 50% of job completion times. Despite this impact, there has been relatively little work on optimizing the performance of these data transfers, with networking researchers traditionally focusing on per-flow traffic management. We address this limitation by proposing a global management architecture and a set of algorithms that (1) improve the transfer times of common communication patterns, such as broadcast and shuffle, and (2) allow scheduling policies at the transfer level, such as prioritizing a transfer over other transfers. Using a prototype implementation, we show that our solution improves broadcast completion times by up to 4.5X compared to the status quo in Hadoop. We also show that transfer-level scheduling can reduce the completion time of high-priority transfers by 1.7X.

[1]  Graham K. Rand,et al.  Quantitative Applications in the Social Sciences , 1983 .

[2]  David B. Shmoys,et al.  Cut problems and their application to divide-and-conquer , 1996 .

[3]  Walid Dabbous,et al.  Multipoint Communication: A Survey of Protocols, Functions, and Mechanisms , 1997, IEEE J. Sel. Areas Commun..

[4]  Miguel Castro,et al.  SplitStream: high-bandwidth multicast in cooperative environments , 2003, SOSP '03.

[5]  Lawrence K. Saul,et al.  Modeling distances in large-scale networks by matrix factorization , 2004, IMC '04.

[6]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  Mukund Seshadri,et al.  On Cooperative Content Distribution and the Price of Barter , 2005, 25th IEEE International Conference on Distributed Computing Systems (ICDCS'05).

[9]  Hong Yan,et al.  A clean slate 4D approach to network control and management , 2005, CCRV.

[10]  T. Karagiannis,et al.  Planet scale software updates , 2006, SIGCOMM.

[11]  Hong Yan,et al.  Tesseract: A 4D Network Control Plane , 2007, NSDI.

[12]  Martín Casado,et al.  Ethane: taking control of the enterprise , 2007, SIGCOMM '07.

[13]  Adrian E. Raftery,et al.  MCLUST Version 3 for R: Normal Mixture Modeling and Model-Based Clustering † , 2007 .

[14]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[15]  Ion Stoica,et al.  A policy-aware switching layer for data centers , 2008, SIGCOMM '08.

[16]  Dennis M. Wilkinson,et al.  Large-Scale Parallel Collaborative Filtering for the Netflix Prize , 2008, AAIM.

[17]  Lei Shi,et al.  Dcell: a scalable and fault-tolerant network structure for data centers , 2008, SIGCOMM '08.

[18]  Albert G. Greenberg,et al.  VL2: a scalable and flexible data center network , 2009, SIGCOMM '09.

[19]  Junda Liu,et al.  Multi-enterprise networking , 2000 .

[20]  Amar Phanishayee,et al.  Safe and effective fine-grained TCP retransmissions for datacenter communication , 2009, SIGCOMM '09.

[21]  Emin Gün Sirer,et al.  AntFarm: Efficient Content Distribution with Managed Swarms , 2009, NSDI.

[22]  Martín Casado,et al.  Extending Networking into the Virtualization Layer , 2009, HotNets.

[23]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[24]  Andrew V. Goldberg,et al.  Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.

[25]  Amin Vahdat,et al.  PortLand: a scalable fault-tolerant layer 2 data center network fabric , 2009, SIGCOMM '09.

[26]  Haitao Wu,et al.  BCube: a high performance, server-centric network architecture for modular data centers , 2009, SIGCOMM '09.

[27]  Amin Vahdat,et al.  Hedera: Dynamic Flow Scheduling for Data Center Networks , 2010, NSDI.

[28]  Antony I. T. Rowstron,et al.  Symbiotic routing in future data centers , 2010, SIGCOMM '10.

[29]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[30]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[31]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[32]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[33]  Benoit Donnet,et al.  A Survey on Network Coordinates Systems, Design, and Security , 2010, IEEE Communications Surveys & Tutorials.

[34]  Dawn Xiaodong Song,et al.  Design and Evaluation of a Real-Time URL Spam Filtering Service , 2011, 2011 IEEE Symposium on Security and Privacy.

[35]  Steven Hand,et al.  CIEL: A Universal Execution Engine for Distributed Data-Flow Computing , 2011, NSDI.

[36]  Albert G. Greenberg,et al.  Sharing the Data Center Network , 2011, NSDI.

[37]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[38]  Michael C. Hout,et al.  Multidimensional Scaling , 2003, Encyclopedic Dictionary of Archaeology.