The Impact of Capacity Scheduler Configuration Settings on MapReduce Jobs

MapReduce is a parallel programming paradigm used for processing huge datasets on certain classes of distributable problems using a cluster. Budgetary constraints and the need for better usage of resources in a MapReduce cluster often influence an organization to rent or share hardware resources for their main data processing and analysis tasks. Thus, there may be many competing jobs from different clients performing simultaneous requests to the MapReduce framework on a particular cluster. Schedulers like Fair Share and Capacity have been specially designed for such purposes. Administrators and users run into performance problems, however, because they do not know the exact meaning of different task scheduler settings and what impact they can have with respect to the application execution time and resource allocation policy decisions. Existing work shows that the performance of MapReduce jobs depends on the cluster configuration, input data type and job configuration settings. However, that work fails to take into account the task scheduler settings. We show, through experimental evaluation, that task scheduler configuration parameters make a significant difference to the performance of the cluster and it is important to understand the influence of such parameters. Based on our findings, we also identified some of the open issues in the existing area of research.

[1]  Herodotos Herodotou,et al.  Profiling, what-if analysis, and cost-based optimization of MapReduce programs , 2011, Proc. VLDB Endow..

[2]  Beng Chin Ooi,et al.  The performance of MapReduce , 2010, Proc. VLDB Endow..

[3]  Guanying Wang,et al.  Using realistic simulation for performance analysis of mapreduce setups , 2009, LSAP '09.

[4]  Roy H. Campbell,et al.  Play It Again, SimMR! , 2011, 2011 IEEE International Conference on Cluster Computing.

[5]  Maozhen Li,et al.  MRSim: A discrete event based MapReduce simulator , 2010, 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery.

[6]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[7]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[8]  Shivnath Babu,et al.  Towards automatic optimization of MapReduce programs , 2010, SoCC '10.

[9]  Guanying Wang,et al.  A simulation approach to evaluating design decisions in MapReduce setups , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[10]  Archana Ganapathi,et al.  The Case for Evaluating MapReduce Performance Using Workload Suites , 2011, 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems.

[11]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.