论文信息 - ShuffleWatcher: Shuffle-aware Scheduling in Multi-tenant MapReduce Clusters

ShuffleWatcher: Shuffle-aware Scheduling in Multi-tenant MapReduce Clusters

MapReduce clusters are usually multi-tenant (i.e., shared among multiple users and jobs) for improving cost and utilization. The performance of jobs in a multitenant MapReduce cluster is greatly impacted by the all-Map-to-all-Reduce communication, or Shuffle, which saturates the cluster's hard-to-scale network bisection bandwidth. Previous schedulers optimize Map input locality but do not consider the Shuffle, which is often the dominant source of traffic in MapReduce clusters. We propose ShuffleWatcher, a new multitenant MapReduce scheduler that shapes and reduces Shuffle traffic to improve cluster performance (throughput and job turn-around times), while operating within specified fairness constraints. ShuffleWatcher employs three key techniques. First, it curbs intra-job Map-Shuffle concurrency to shape Shuffle traffic by delaying or elongating a job's Shuffle based on the network load. Second, it exploits the reduced intra-job concurrency and the flexibility engendered by the replication of Map input data for fault tolerance to preferentially assign a job's Map tasks to localize the Map output to as few nodes as possible. Third, it exploits localized Map output and delayed Shuffle to reduce the Shuffle traffic by preferentially assigning a job's Reduce tasks to the nodes containing its Map output. ShuffleWatcher leverages opportunities that are unique to multi-tenancy, such overlapping Map with Shuffle across jobs rather than within a job, and trading-off intra-job concurrency for reduced Shuffle traffic. On a 100-node Amazon EC2 cluster running Hadoop, ShuffleWatcher improves cluster throughput by 39-46% and job turn-around times by 27-32% over three state-of-the-art schedulers.

[1] Ling Liu,et al. Purlieus: Locality-aware resource allocation for MapReduce in a cloud , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[2] Lei Shi,et al. Dcell: a scalable and fault-tolerant network structure for data centers , 2008, SIGCOMM '08.

[3] Geoffrey C. Fox,et al. Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[4] Michael J. Franklin,et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[5] László Gyarmati,et al. Scafida: a scale-free network inspired data center architecture , 2010, CCRV.

[6] Douglas Thain,et al. Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[7] Archana Ganapathi,et al. The Case for Evaluating MapReduce Performance Using Workload Suites , 2011, 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems.

[8] Joseph M. Hellerstein,et al. MapReduce Online , 2010, NSDI.

[9] Michael D. Ernst,et al. The HaLoop approach to large-scale iterative data analysis , 2012, The VLDB Journal.

[10] Antony Rowstron,et al. Symbiotic routing in future data centers , 2010, SIGCOMM 2010.

[11] Haitao Wu,et al. BCube: a high performance, server-centric network architecture for modular data centers , 2009, SIGCOMM '09.

[12] Seyong Lee,et al. MapReduce with communication overlap (MaRCO) , 2013, J. Parallel Distributed Comput..

[13] Scott Rixner,et al. Medusa: Managing Concurrency and Communication in Embedded Systems , 2014, USENIX Annual Technical Conference.

[14] Michael I. Jordan,et al. Managing data transfers in computer clusters with orchestra , 2011, SIGCOMM.

[15] Indranil Gupta,et al. Breaking the MapReduce stage barrier , 2010, 2010 IEEE International Conference on Cluster Computing.

[16] Amin Vahdat,et al. PortLand: a scalable fault-tolerant layer 2 data center network fabric , 2009, SIGCOMM '09.

[17] Scott Shenker,et al. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[18] Albert G. Greenberg,et al. Sharing the Data Center Network , 2011, NSDI.

[19] Albert G. Greenberg,et al. Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.