Scheduling Dependent Coflows with Guaranteed Job Completion Time

Today's data center jobs typically follow a coflow model. Each coflow consists of multiple concurrent data flows, while each job is comprised of multiple coflows. Only completing all flows in all coflows is meaningful to a job. To guarantee the job completion time, job deadlines and coflow dependencies must be jointly considered. However, existing solutions mainly consider the coflow scheduling, which are insufficient to guarantee the completion time of jobs with multiple dependent coflows. In this paper, we study the dependent coflow scheduling problem with constraints on job deadlines. Specifically, we formulate a deadline-and dependency-aware optimization problem, and accordingly propose a two-level scheduling method to solve this problem. The first level is to schedule at the job level with a most-bottleneck-first heuristic algorithm. The second level is an intra-job scheduling, which seamlessly combine a prioritized scheduling and a weighted fair scheduling, with the aim of accounting for different coflow dependencies. We conduct comprehensive simulations to evaluate the performance of our two-level scheduling method. Extensive results show that our scheduling method can reduce the job completion time by up to 18%, and accommodate 21% more jobs with deadlines guaranteed, compared to the conventional shortest-job-first method.

[1]  Ion Stoica,et al.  Efficient Coflow Scheduling Without Prior Knowledge , 2015, SIGCOMM.

[2]  Chen Liang,et al.  Participatory networking: an API for application control of SDNs , 2013, SIGCOMM.

[3]  Ion Stoica,et al.  Coflow: a networking abstraction for cluster applications , 2012, HotNets-XI.

[4]  Amin Vahdat,et al.  Hedera: Dynamic Flow Scheduling for Data Center Networks , 2010, NSDI.

[5]  Yanpei Chen,et al.  Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads , 2012, Proc. VLDB Endow..

[6]  Brighten Godfrey,et al.  Finishing flows quickly with preemptive scheduling , 2012, CCRV.

[7]  Nick McKeown,et al.  pFabric: minimal near-optimal datacenter transport , 2013, SIGCOMM.

[8]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[9]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[10]  Athanasios V. Vasilakos,et al.  Energy-Efficient Flow Scheduling and Routing with Hard Deadlines in Data Center Networks , 2014, 2014 IEEE 34th International Conference on Distributed Computing Systems.

[11]  Scott Shenker,et al.  Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters , 2012, HotCloud.

[12]  Mark Handley,et al.  Improving datacenter performance and robustness with multipath TCP , 2011, SIGCOMM 2011.

[13]  Ion Stoica,et al.  Efficient coflow scheduling with Varys , 2015, SIGCOMM.

[14]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[15]  Hitesh Ballani,et al.  Towards predictable datacenter networks , 2011, SIGCOMM 2011.

[16]  Amin Vahdat,et al.  PortLand: a scalable fault-tolerant layer 2 data center network fabric , 2009, SIGCOMM '09.

[17]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[18]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[19]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[20]  Michael I. Jordan,et al.  Managing data transfers in computer clusters with orchestra , 2011, SIGCOMM.

[21]  F. Frances Yao,et al.  A scheduling model for reduced CPU energy , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[22]  Ion Stoica,et al.  Coflow: An Application Layer Abstraction for Cluster Networking , 2012 .