ShuffleWatcher: Shuffle-aware Scheduling in Multi-tenant MapReduce Clusters

MapReduce clusters are usually multi-tenant (i.e., shared among multiple users and jobs) for improving cost and utilization. The performance of jobs in a multitenant MapReduce cluster is greatly impacted by the all-Map-to-all-Reduce communication, or Shuffle, which saturates the cluster's hard-to-scale network bisection bandwidth. Previous schedulers optimize Map input locality but do not consider the Shuffle, which is often the dominant source of traffic in MapReduce clusters. We propose ShuffleWatcher, a new multitenant MapReduce scheduler that shapes and reduces Shuffle traffic to improve cluster performance (throughput and job turn-around times), while operating within specified fairness constraints. ShuffleWatcher employs three key techniques. First, it curbs intra-job Map-Shuffle concurrency to shape Shuffle traffic by delaying or elongating a job's Shuffle based on the network load. Second, it exploits the reduced intra-job concurrency and the flexibility engendered by the replication of Map input data for fault tolerance to preferentially assign a job's Map tasks to localize the Map output to as few nodes as possible. Third, it exploits localized Map output and delayed Shuffle to reduce the Shuffle traffic by preferentially assigning a job's Reduce tasks to the nodes containing its Map output. ShuffleWatcher leverages opportunities that are unique to multi-tenancy, such overlapping Map with Shuffle across jobs rather than within a job, and trading-off intra-job concurrency for reduced Shuffle traffic. On a 100-node Amazon EC2 cluster running Hadoop, ShuffleWatcher improves cluster throughput by 39-46% and job turn-around times by 27-32% over three state-of-the-art schedulers.

[1]  Ling Liu,et al.  Purlieus: Locality-aware resource allocation for MapReduce in a cloud , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[2]  Lei Shi,et al.  Dcell: a scalable and fault-tolerant network structure for data centers , 2008, SIGCOMM '08.

[3]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[4]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[5]  László Gyarmati,et al.  Scafida: a scale-free network inspired data center architecture , 2010, CCRV.

[6]  Douglas Thain,et al.  Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[7]  Archana Ganapathi,et al.  The Case for Evaluating MapReduce Performance Using Workload Suites , 2011, 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems.

[8]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[9]  Michael D. Ernst,et al.  The HaLoop approach to large-scale iterative data analysis , 2012, The VLDB Journal.

[10]  Antony Rowstron,et al.  Symbiotic routing in future data centers , 2010, SIGCOMM 2010.

[11]  Haitao Wu,et al.  BCube: a high performance, server-centric network architecture for modular data centers , 2009, SIGCOMM '09.

[12]  Seyong Lee,et al.  MapReduce with communication overlap (MaRCO) , 2013, J. Parallel Distributed Comput..

[13]  Scott Rixner,et al.  Medusa: Managing Concurrency and Communication in Embedded Systems , 2014, USENIX Annual Technical Conference.

[14]  Michael I. Jordan,et al.  Managing data transfers in computer clusters with orchestra , 2011, SIGCOMM.

[15]  Indranil Gupta,et al.  Breaking the MapReduce stage barrier , 2010, 2010 IEEE International Conference on Cluster Computing.

[16]  Amin Vahdat,et al.  PortLand: a scalable fault-tolerant layer 2 data center network fabric , 2009, SIGCOMM '09.

[17]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[18]  Albert G. Greenberg,et al.  Sharing the Data Center Network , 2011, NSDI.

[19]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[20]  Miron Livny,et al.  Distributed computing in practice: the Condor experience: Research Articles , 2005 .

[21]  Wolfgang Schröder-Preikschat,et al.  Static Analysis of Variability in System Software: The 90, 000 #ifdefs Issue , 2014, USENIX Annual Technical Conference.

[22]  Ankit Singla,et al.  Jellyfish: Networking Data Centers Randomly , 2011, NSDI.

[23]  Amin Vahdat,et al.  A scalable, commodity data center network architecture , 2008, SIGCOMM '08.

[24]  Antony I. T. Rowstron,et al.  Camdoop: Exploiting In-network Aggregation for Big Data Applications , 2012, NSDI.

[25]  Garrick Staples,et al.  TORQUE resource manager , 2006, SC.

[26]  Jeffrey C. Mogul,et al.  Taming the Flying Cable Monster: A Topology Design and Optimization Framework for Data-Center Networks , 2011, USENIX ATC.

[27]  Benjamin Hindman,et al.  Dominant Resource Fairness: Fair Allocation of Multiple Resource Types , 2011, NSDI.

[28]  Amin Vahdat,et al.  Helios: a hybrid electrical/optical switch architecture for modular data centers , 2010, SIGCOMM '10.

[29]  Andrew V. Goldberg,et al.  Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.

[30]  Raghu Ramakrishnan,et al.  Sailfish: a framework for large scale data processing , 2012, SoCC '12.

[31]  Min Li,et al.  CAM: a topology aware minimum cost flow based resource manager for MapReduce applications in the cloud , 2012, HPDC '12.

[32]  T. N. Vijaykumar,et al.  Tarazu: optimizing MapReduce on heterogeneous clusters , 2012, ASPLOS XVII.

[33]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[34]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[35]  Angela Demke Brown,et al.  Reliable Writeback for Client-side Flash Caches , 2014, USENIX Annual Technical Conference.

[36]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[37]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[38]  Amin Vahdat,et al.  Scale-Out Networking in the Data Center , 2010, IEEE Micro.