Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can

To reduce the impact of network congestion on big data jobs, cluster management frameworks use various heuristics to schedule compute tasks and/or network flows. Most of these schedulers consider the job input data fixed and greedily schedule the tasks and flows that are ready to run. However, a large fraction of production jobs are recurring with predictable characteristics, which allows us to plan ahead for them. Coordinating the placement of data and tasks of these jobs allows for significantly improving their network locality and freeing up bandwidth, which can be used by other jobs running on the cluster. With this intuition, we develop Corral, a scheduling framework that uses characteristics of future workloads to determine an offline schedule which (i) jointly places data and compute to achieve better data locality, and (ii) isolates jobs both spatially (by scheduling them in different parts of the cluster) and temporally, improving their performance. We implement Corral on Apache Yarn, and evaluate it on a 210 machine cluster using production workloads. Compared to Yarn's capacity scheduler, Corral reduces the makespan of these workloads up to 33% and the median completion time up to 56%, with 20-90% reduction in data transferred across racks.

[1]  Srikanth Kandula,et al.  Jockey: guaranteed job latency in data parallel clusters , 2012, EuroSys '12.

[2]  Diksha Verma,et al.  Quincy: Fair Scheduling for Distributed Computing Clusters , 2014 .

[3]  Sriram Rao,et al.  A The Quantcast File System , 2013, Proc. VLDB Endow..

[4]  RENAUD LEPÈRE,et al.  Approximation Algorithms for Scheduling Malleable Tasks Under Precedence Constraints , 2001, Int. J. Found. Comput. Sci..

[5]  Nicolas Bruno,et al.  SCOPE: parallel databases meet MapReduce , 2012, The VLDB Journal.

[6]  Ronald L. Graham,et al.  Bounds on Multiprocessing Timing Anomalies , 1969, SIAM Journal of Applied Mathematics.

[7]  Antony I. T. Rowstron,et al.  Bridging the tenant-provider gap in cloud services , 2012, SoCC '12.

[8]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[9]  Srikanth Kandula,et al.  Leveraging endpoint flexibility in data-intensive clusters , 2013, SIGCOMM.

[10]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11]  Magdalena Balazinska,et al.  SkewTune: mitigating skew in mapreduce applications , 2012, SIGMOD Conference.

[12]  Srikanth Kandula,et al.  PACMan: Coordinated Memory Caching for Parallel Jobs , 2012, NSDI.

[13]  Christo Wilson,et al.  Better never than late , 2011, SIGCOMM 2011.

[14]  Antony I. T. Rowstron,et al.  Better never than late: meeting deadlines in datacenter networks , 2011, SIGCOMM.

[15]  Ling Liu,et al.  Purlieus: Locality-aware resource allocation for MapReduce in a cloud , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[16]  Srikanth Kandula,et al.  Reoptimizing Data Parallel Computing , 2012, NSDI.

[17]  Antony I. T. Rowstron,et al.  Decentralized task-aware scheduling for data center networks , 2014, SIGCOMM.

[18]  Philip S. Yu,et al.  Approximate algorithms scheduling parallelizable tasks , 1992, SPAA '92.

[19]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[20]  Min Li,et al.  CAM: a topology aware minimum cost flow based resource manager for MapReduce applications in the cloud , 2012, HPDC '12.

[21]  Andrew V. Goldberg,et al.  Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.

[22]  Ion Stoica,et al.  Efficient coflow scheduling with Varys , 2014, SIGCOMM.

[23]  Herodotos Herodotou,et al.  No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics , 2011, SoCC.

[24]  Brighten Godfrey,et al.  Finishing flows quickly with preemptive scheduling , 2012, CCRV.

[25]  Albert G. Greenberg,et al.  Scarlett: coping with skewed content popularity in mapreduce clusters , 2011, EuroSys '11.

[26]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[27]  Prithviraj Banerjee,et al.  An Approximate Algorithm for the Partitionable Independent Task Scheduling Problem , 1990, ICPP.

[28]  Y.-K. Kwok,et al.  Static scheduling algorithms for allocating directed task graphs to multiprocessors , 1999, CSUR.

[29]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[30]  Anand Raghunathan,et al.  ShuffleWatcher: Shuffle-aware Scheduling in Multi-tenant MapReduce Clusters , 2014, USENIX Annual Technical Conference.

[31]  Guanying Wang,et al.  A simulation approach to evaluating design decisions in MapReduce setups , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[32]  Archana Ganapathi,et al.  The Case for Evaluating MapReduce Performance Using Workload Suites , 2011, 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems.

[33]  Yuanyuan Tian,et al.  CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop , 2011, Proc. VLDB Endow..

[34]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[35]  Michael I. Jordan,et al.  Managing data transfers in computer clusters with orchestra , 2011, SIGCOMM.

[36]  Joseph Y.-T. Leung,et al.  Complexity of Scheduling Parallel Task Systems , 1989, SIAM J. Discret. Math..

[37]  Srikanth Kandula,et al.  Multi-resource packing for cluster schedulers , 2014, SIGCOMM.

[38]  Khaled Elmeleegy,et al.  Piranha: Optimizing Short Jobs in Hadoop , 2013, Proc. VLDB Endow..