Simulation and Performance Evaluation of Hadoop Capacity Scheduler

MapReduce is a parallel programming paradigm used for processing huge datasets on certain classes of distributable problems using a cluster. Budgetary constraints and the need for better usage of resources in a MapReduce cluster often make organizations rent or share hardware resources for their main data processing and analysis tasks. Thus, there may be many competing jobs from different clients performing simultaneous requests to the MapReduce framework on a particular cluster. Schedulers like Fair Share and Capacity have been specially designed for such purposes. Administrators and users run into performance problems, however, because they do not know the exact meaning of different task scheduler settings and what impact they can have with respect to the resource allocation scheme across organizations for a shared MapReduce cluster. In this work, Capacity Scheduler is integrated into an existing MRPERF simulator to predict the performance of MapReduce jobs in a shared cluster under different settings for Capacity Scheduler. A few case studies on the behaviour of Capacity Scheduler across different job patterns etc. using integrated simulator are also conducted.

[1]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[2]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[3]  Thomas L. Magnanti,et al.  Applied Mathematical Programming , 1977 .

[4]  Yanpei Chen,et al.  Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads , 2012, Proc. VLDB Endow..

[5]  Beng Chin Ooi,et al.  The performance of MapReduce , 2010, Proc. VLDB Endow..

[6]  Shivnath Babu,et al.  Towards automatic optimization of MapReduce programs , 2010, SoCC '10.

[7]  Per Brinch Hansen,et al.  Operating System Principles , 1973 .

[8]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[9]  Lei Yu,et al.  SimMapReduce: A Simulator for Modeling MapReduce Framework , 2011, 2011 Fifth FTRA International Conference on Multimedia and Ubiquitous Engineering.

[10]  Winfried K. Grassmann,et al.  The Impact of Capacity Scheduler Configuration Settings on MapReduce Jobs , 2012, 2012 Second International Conference on Cloud and Green Computing.

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  Ian T. Foster,et al.  Globus: a Metacomputing Infrastructure Toolkit , 1997, Int. J. High Perform. Comput. Appl..

[13]  Andrew V. Goldberg,et al.  Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.

[14]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[15]  Roy H. Campbell,et al.  Play It Again, SimMR! , 2011, 2011 IEEE International Conference on Cluster Computing.

[16]  Guanying Wang,et al.  Using realistic simulation for performance analysis of mapreduce setups , 2009, LSAP '09.

[17]  Roy H. Campbell,et al.  Two Sides of a Coin: Optimizing the Schedule of MapReduce Jobs to Minimize Their Makespan and Improve Cluster Performance , 2012, 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[18]  Mark J. Clement,et al.  Core Algorithms of the Maui Scheduler , 2001, JSSPP.

[19]  Divyakant Agrawal,et al.  Big data and cloud computing , 2010, Proc. VLDB Endow..

[20]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[21]  Satoshi Matsuoka,et al.  Overview of a performance evaluation system for global computing scheduling algorithms , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[22]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[23]  Henri Casanova,et al.  Simgrid: a toolkit for the simulation of application scheduling , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[24]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[25]  G. Ganger,et al.  Applying Simple Performance Models to Understand Inefficiencies in Data-Intensive Computing , 2011 .

[26]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[27]  Herodotos Herodotou Hadoop Performance Models , 2011, ArXiv.

[28]  Gerhard Weikum,et al.  Rethinking Database System Architecture: Towards a Self-Tuning RISC-Style Database System , 2000, VLDB.

[29]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[30]  Maozhen Li,et al.  HSim: A MapReduce simulator in enabling Cloud Computing , 2013, Future Gener. Comput. Syst..

[31]  Thomas Sandholm,et al.  Dynamic Proportional Share Scheduling in Hadoop , 2010, JSSPP.

[32]  Bernd Freisleben,et al.  Xen and the Art of Cluster Scheduling , 2006, First International Workshop on Virtualization Technology in Distributed Computing (VTDC 2006).

[33]  Andrew A. Chien,et al.  The MicroGrid: a Scientific Tool for Modeling Computational Grids , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[34]  Guanying Wang,et al.  A simulation approach to evaluating design decisions in MapReduce setups , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[35]  Archana Ganapathi,et al.  The Case for Evaluating MapReduce Performance Using Workload Suites , 2011, 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems.

[36]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[37]  Pietro Michiardi,et al.  HFSP: Size-based scheduling for Hadoop , 2013, 2013 IEEE International Conference on Big Data.

[38]  Deborah Estrin,et al.  Advances in network simulation , 2000, Computer.

[39]  Roy H. Campbell,et al.  ARIA: automatic resource inference and allocation for mapreduce environments , 2011, ICAC '11.

[40]  Florian Waas,et al.  Online Expansion of Largescale Data Warehouses , 2011, Proc. VLDB Endow..

[41]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[42]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[43]  Maozhen Li,et al.  MRSim: A discrete event based MapReduce simulator , 2010, 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery.

[44]  Chao Tian,et al.  A Dynamic MapReduce Scheduler for Heterogeneous Workloads , 2009, 2009 Eighth International Conference on Grid and Cooperative Computing.

[45]  Rajkumar Buyya,et al.  GridSim: a toolkit for the modeling and simulation of distributed resource management and scheduling for Grid computing , 2002, Concurr. Comput. Pract. Exp..