论文信息 - Coflow: A Networking Abstraction for Distributed Data-Parallel Applications

Coflow: A Networking Abstraction for Distributed Data-Parallel Applications

Over the past decade, the confluence of an unprecedented growth in data volumes and the rapid rise of cloud computing has fundamentally transformed systems software and corresponding infrastructure. To deal with massive datasets, more and more applications today are scaling out to large datacenters. These distributed data-parallel applications run on tens to thousands of machines in parallel to exploit I/O parallelism, and they enable a wide variety of use cases, including interactive analysis, SQL queries, machine learning, and graph processing. Communication between the distributed computation tasks of these applications often result in massive data transfers over the network. Consequently, concentrated efforts in both industry and academia have gone into building high-capacity, low-latency datacenter networks at scale. At the same time, researchers and practitioners have proposed a wide variety of solutions to minimize flow completion times or to ensure per-flow fairness based on the point-to-point flow abstraction that forms the basis of the TCP/IP stack. We observe that despite rapid innovations in both applications and infrastructure, application- and network-level goals are moving further apart. Data-parallel applications care about all their flows, but today’s networks treat each point-to-point flow independently. This fundamental mismatch has resulted in complex point solutions for application developers, a myriad of configuration options for end users, and an overall loss of performance. The key contribution of this dissertation is bridging this gap between application-level performance and network-level optimizations through the coflow abstraction. Each multipoint-to-multipoint coflow represents a collection of flows with a common application-level performance objective, enabling application-aware decision making in the network. We describe complete solutions including architectures, algorithms, and implementations that apply coflows to multiple scenarios using central coordination, and we demonstrate through large-scale cloud deployments and trace-driven simulations that simply knowing how flows relate to each other is enough for better network scheduling, meeting more deadlines, and providing higher performance isolation than what is otherwise possible using today’s application-agnostic solutions. In addition to performance improvements, coflows allow us to consolidate communication optimizations across multiple applications, simplifying software development and relieving end users from parameter tuning. On the theoretical front, we discover and characterize for the first time the concurrent open shop scheduling with coupled resources family of problems. Because any flow is also a coflow with just one flow, coflows and coflow-based solutions presented in this dissertation generalize a large body of work in both networking and scheduling literatures.

Mosharaf Chowdhury | Mosharaf Chowdhury | Mosharaf Chowdhury | Mosharaf Chowdhury

[1] Nick McKeown,et al. Rate control protocol (rcp): congestion control to make flows complete quickly , 2008 .

[2] James E. Kelley,et al. Critical-Path Planning and Scheduling: Mathematical Basis , 1961 .

[3] J. Nash. THE BARGAINING PROBLEM , 1950, Classics in Game Theory.

[4] Dennis M. Wilkinson,et al. Large-Scale Parallel Collaborative Filtering for the Netflix Prize , 2008, AAIM.

[5] Hitesh Ballani,et al. Towards predictable datacenter networks , 2011, SIGCOMM 2011.

[6] Jeffrey M. Jaffe,et al. Bottleneck Flow Control , 1981, IEEE Trans. Commun..

[7] Michael J. Franklin,et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[8] Jingren Zhou,et al. SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[9] Rene L. Cruz,et al. A calculus for network delay, Part II: Network analysis , 1991, IEEE Trans. Inf. Theory.

[10] Christopher Scaffidi,et al. Why are APIs difficult to learn and use? , 2006, CROS.

[11] Hong Yan,et al. A clean slate 4D approach to network control and management , 2005, CCRV.

[12] Albert G. Greenberg,et al. VL2: a scalable and flexible data center network , 2009, SIGCOMM '09.

[13] Antonio Fernández,et al. Bisection (Band)Width of Product Networks with Application to Data Centers , 2012, IEEE Transactions on Parallel and Distributed Systems.

[14] Justine Sherry,et al. Silo: Predictable Message Completion Time in the Cloud , 2013 .

[15] Konstantina Papagiannaki,et al. c-Through: part-time optics in data centers , 2010, SIGCOMM '10.

[16] Lawrence K. Saul,et al. Modeling distances in large-scale networks by matrix factorization , 2004, IMC '04.

[17] M. M. Flood. Some Experimental Games , 1958 .

[18] Daniel Mills,et al. MillWheel: Fault-Tolerant Stream Processing at Internet Scale , 2013, Proc. VLDB Endow..

[19] Junda Liu,et al. Multi-enterprise networking , 2000 .

[20] Benoit Donnet,et al. A Survey on Network Coordinates Systems, Design, and Security , 2010, IEEE Communications Surveys & Tutorials.

[21] Scott Shenker,et al. Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[22] Ronald L. Graham,et al. Bounds on Multiprocessing Timing Anomalies , 1969, SIAM Journal of Applied Mathematics.

[23] Amin Vahdat,et al. PortLand: a scalable fault-tolerant layer 2 data center network fabric , 2009, SIGCOMM '09.

[24] Renata Teixeira,et al. Traffic classification on the fly , 2006, CCRV.

[25] Antony I. T. Rowstron,et al. Better never than late: meeting deadlines in datacenter networks , 2011, SIGCOMM.

[26] Scott Shenker,et al. Analysis and simulation of a fair queueing algorithm , 1989, SIGCOMM '89.

[27] Rene L. Cruz,et al. A calculus for network delay, Part I: Network elements in isolation , 1991, IEEE Trans. Inf. Theory.

[28] Edward G. Coffman,et al. Feedback Queueing Models for Time-Shared Systems , 1968, J. ACM.

[29] Paramvir Bahl,et al. Low Latency Geo-distributed Data Analytics , 2015, SIGCOMM.

[30] QueueingJon,et al. WF 2 Q : Worst-case Fair Weighted Fair , 1996 .

[31] Ravi Kumar,et al. Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[32] Ankit Singla,et al. OSA: An Optical Switching Architecture for Data Center Networks With Unprecedented Flexibility , 2012, IEEE/ACM Transactions on Networking.

[33] Haitao Wu,et al. BCube: a high performance, server-centric network architecture for modular data centers , 2009, SIGCOMM '09.

[34] Brighten Godfrey,et al. Finishing flows quickly with preemptive scheduling , 2012, CCRV.

[35] Grenville J. Armitage,et al. A survey of techniques for internet traffic classification using machine learning , 2008, IEEE Communications Surveys & Tutorials.

[36] Walid Dabbous,et al. Multipoint Communication: A Survey of Protocols, Functions, and Mechanisms , 1997, IEEE J. Sel. Areas Commun..

[37] Randy H. Katz,et al. Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[38] Christina Delimitrou,et al. Paragon: QoS-aware scheduling for heterogeneous datacenters , 2013, ASPLOS '13.

[39] George Varghese,et al. Netshare and stochastic netshare: predictable bandwidth allocation for data centers , 2012, CCRV.

[40] Christos Gkantsidis,et al. Planet scale software updates , 2006, SIGCOMM '06.

[41] Benjamin Hindman,et al. Dominant Resource Fairness: Fair Allocation of Multiple Resource Types , 2011, NSDI.

[42] Rajeev Motwani,et al. Non-clairvoyant scheduling , 1994, SODA '93.

[43] Charles E. Leiserson,et al. Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.

[44] Miguel Castro,et al. FaRM: Fast Remote Memory , 2014, NSDI.

[45] Abhay Parekh,et al. A generalized processor sharing approach to flow control in integrated services networks: the single-node case , 1993, TNET.

[46] Hong Yan,et al. Tesseract: A 4D Network Control Plane , 2007, NSDI.

[47] Chen Liang,et al. Participatory networking: an API for application control of SDNs , 2013, SIGCOMM.

[48] Vyas Sekar,et al. Multi-resource fair queueing for packet processing , 2012, CCRV.

[49] Scott Shenker,et al. Disk-Locality in Datacenter Computing Considered Irrelevant , 2011, HotOS.

[50] Robert C. Daley,et al. An experimental time-sharing system , 1962, AIEE-IRE '62 (Spring).

[51] Wei Lin,et al. Microsoft Bing Peking University , 2022 .

[52] Antony I. T. Rowstron,et al. Symbiotic routing in future data centers , 2010, SIGCOMM '10.

[53] Gautam Kumar,et al. A Case for Performance-Centric Network Allocation , 2012, HotCloud.

[54] Patrick Wendell,et al. Sparrow: distributed, low latency scheduling , 2013, SOSP.

[55] Michael C. Hout,et al. Multidimensional Scaling , 2003, Encyclopedic Dictionary of Archaeology.

[56] Amin Vahdat,et al. Integrating microsecond circuit switching into the data center , 2013, SIGCOMM.

[57] Scheduling: the Multi-level Feedback Queue , .

[58] Lucian Popa,et al. What we talk about when we talk about cloud network performance , 2012, CCRV.

[59] GhemawatSanjay,et al. The Google file system , 2003 .

[60] Albert G. Greenberg,et al. The nature of data center traffic: measurements & analysis , 2009, IMC '09.

[61] D. Zats,et al. DeTail: reducing the flow completion time tail in datacenter networks , 2012, CCRV.

[62] Rob Pike,et al. Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[63] William E. Weihl,et al. Lottery scheduling: flexible proportional-share resource management , 1994, OSDI '94.

[64] Dawn Xiaodong Song,et al. Design and Evaluation of a Real-Time URL Spam Filtering Service , 2011, 2011 IEEE Symposium on Security and Privacy.

[65] Ion Stoica,et al. A policy-aware switching layer for data centers , 2008, SIGCOMM '08.

[66] Shirish Tatikonda,et al. SystemML: Declarative machine learning on MapReduce , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[67] Ion Stoica,et al. Efficient Coflow Scheduling Without Prior Knowledge , 2015, SIGCOMM.

[68] Charles Clos,et al. A study of non-blocking switching networks , 1953 .

[69] Amin Vahdat,et al. Helios: a hybrid electrical/optical switch architecture for modular data centers , 2010, SIGCOMM '10.

[70] Martín Casado,et al. Ethane: taking control of the enterprise , 2007, SIGCOMM '07.

[71] David G. Andersen,et al. An Architecture for Internet Data Transfer , 2006, NSDI.

[72] Van Jacobson,et al. Link-sharing and resource management models for packet networks , 1995, TNET.

[73] Ben Y. Zhao,et al. Mirror mirror on the ceiling: flexible wireless links for data centers , 2012, CCRV.

[74] Haitao Wu,et al. ICTCP: Incast Congestion Control for TCP in Data-Center Networks , 2010, IEEE/ACM Transactions on Networking.

[75] Joseph K. Bradley,et al. Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[76] David A. Maltz,et al. Cloudward bound: planning for beneficial migration of enterprise applications to the cloud , 2010, SIGCOMM '10.

[77] Amar Phanishayee,et al. Safe and effective fine-grained TCP retransmissions for datacenter communication , 2009, SIGCOMM '09.

[78] Adam Wierman,et al. The Foreground-Background queue: A survey , 2008, Perform. Evaluation.

[79] Johannes Gehrke,et al. Gossip-based computation of aggregate information , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[80] Scott Shenker,et al. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[81] Nick McKeown,et al. pFabric: minimal near-optimal datacenter transport , 2013, SIGCOMM.

[82] Albert G. Greenberg,et al. Sharing the Data Center Network , 2011, NSDI.

[83] Jean-Philippe Martin,et al. Dandelion: a compiler and runtime for heterogeneous systems , 2013, SOSP.

[84] Abhay Parekh,et al. A generalized processor sharing approach to flow control in integrated services networks-the single node case , 1992, [Proceedings] IEEE INFOCOM '92: The Conference on Computer Communications.

[85] Yuan Yu,et al. Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[86] Guillaume Urvoy-Keller,et al. Analysis of LAS scheduling for job size distributions with high variance , 2003, SIGMETRICS '03.

[87] Ronald L. Graham,et al. Bounds for certain multiprocessing anomalies , 1966 .

[88] Yuan Zhong,et al. Minimizing the Total Weighted Completion Time of Coflows in Datacenter Networks , 2015, SPAA.

[89] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[90] Santosh Krishnan,et al. Google Compute Engine , 2015 .

[91] Randy H. Katz,et al. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[92] Di Xie,et al. The only constant is change: incorporating time-varying network reservations in data centers , 2012, CCRV.

[93] Srikanth Kandula,et al. PACMan: Coordinated Memory Caching for Parallel Jobs , 2012, NSDI.

[94] Tim Kraska,et al. MLbase: A Distributed Machine-learning System , 2013, CIDR.

[95] David A. Maltz,et al. Surviving failures in bandwidth-constrained datacenters , 2012, CCRV.

[96] Amin Vahdat,et al. Hedera: Dynamic Flow Scheduling for Data Center Networks , 2010, NSDI.

[97] Albert G. Greenberg,et al. Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[98] Jiaxing Zhang,et al. Spotting Code Optimizations in Data-Parallel Pipelines through PeriSCOPE , 2012, OSDI.

[99] Srikanth Kandula,et al. Multi-resource packing for cluster schedulers , 2014, SIGCOMM.

[100] Mor Harchol-Balter,et al. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[101] Ishai Menache,et al. Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can , 2015, SIGCOMM.

[102] Ola Svensson,et al. Minimizing the sum of weighted completion times in a concurrent open shop , 2010, Oper. Res. Lett..

[103] Christina Delimitrou,et al. Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[104] Luiz André Barroso,et al. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[105] Albert G. Greenberg,et al. A flexible model for resource management in virtual private networks , 1999, SIGCOMM '99.

[106] Sujata Banerjee,et al. Application-driven bandwidth guarantees in datacenters , 2014, SIGCOMM.

[107] Hui Zhang,et al. WF/sup 2/Q: worst-case fair weighted fair queueing , 1996, Proceedings of IEEE INFOCOM '96. Conference on Computer Communications.

[108] Martin P. Robillard,et al. What Makes APIs Hard to Learn? Answers from Developers , 2009, IEEE Software.

[109] Paramvir Bahl,et al. Augmenting data center networks with multi-gigabit wireless links , 2011, SIGCOMM.

[110] Michael I. Jordan,et al. Managing data transfers in computer clusters with orchestra , 2011, SIGCOMM.

[111] Hong Liu,et al. Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network , 2015, Comput. Commun. Rev..

[112] Hari Balakrishnan,et al. Cicada: Introducing Predictive Guarantees for Cloud Networks , 2014, HotCloud.

[113] Anthony McGregor,et al. Flow Clustering Using Machine Learning Techniques , 2004, PAM.

[114] Martin Grund,et al. Impala: A Modern, Open-Source SQL Engine for Hadoop , 2015, CIDR.

[115] D. Ruppert. The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[116] Jayati. The Berkeley Data Analytics Stack (BDAS) , 2014, 2014 Conference on IT in Business, Industry and Government (CSIBIG).

[117] Ion Stoica,et al. FairCloud: sharing the network in cloud computing , 2011, SIGCOMM '12.

[118] Srikanth Kandula,et al. Walking the tightrope: responsive yet stable traffic engineering , 2005, SIGCOMM '05.

[119] Albert G. Greenberg,et al. EyeQ: Practical Network Performance Isolation at the Edge , 2013, NSDI.