Scalable Performance Tuning of Hadoop MapReduce: A Noisy Gradient Approach

Hadoop MapReduce is a popular framework for distributed storage and processing of large datasets and is used for big data analytics. It has various configuration parameters which play an important role in deciding the performance i.e., the execution time of a given big data processing job. Default values of these parameters do not result in good performance and therefore it is important to tune them. However, there is inherent difficulty in tuning the parameters due to two important reasons - first, the parameter search space is large and second, there are cross-parameter interactions. Hence, there is a need for a dimensionality-free method which can automatically tune the configuration parameters by taking into account the cross-parameter dependencies. In this paper, we propose a novel Hadoop parameter tuning methodology, based on a noisy gradient algorithm known as the simultaneous perturbation stochastic approximation (SPSA). The SPSA algorithm tunes the selected parameters by directly observing the performance of the Hadoop MapReduce system. The approach followed is independent of parameter dimensions and requires only 2 observations per iteration while tuning. We demonstrate the effectiveness of our methodology in achieving good performance on popular Hadoop benchmarks namely Grep, Bigram, Inverted Index, Word Co-occurrence and Terasort. Our method, when tested on a 25 node Hadoop cluster shows 45-66% decrease in execution time of Hadoop jobs on an average, when compared to prior methods. Further, our experiments also indicate that the parameters tuned by our method are resilient to changes in number of cluster nodes, which makes our method suitable to optimize Hadoop when it is provided as a service on the cloud.

[1]  Harold J. Kushner,et al.  wchastic. approximation methods for constrained and unconstrained systems , 1978 .

[2]  J. Spall Multivariate stochastic approximation using a simultaneous perturbation gradient approximation , 1992 .

[3]  Han-Fu Chen,et al.  Robust adaptive pole placement for linear time-varying systems , 1996, IEEE Trans. Autom. Control..

[4]  Han-Fu Chen,et al.  Convergence rates of perturbation-analysis-Robbins-Monro-single-run algorithms for single server queues , 1997, IEEE Trans. Autom. Control..

[5]  S. D. Hill,et al.  Simulation optimization of airline delay with constraints , 2001, Proceeding of the 2001 Winter Simulation Conference (Cat. No.01CH37304).

[6]  Jacqueline Le Moigne,et al.  Multiresolution registration of remote sensing imagery by optimization of mutual information using a stochastic gradient , 2003, IEEE Trans. Image Process..

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[9]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[10]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[11]  Himabindu Pucha,et al.  Towards Optimizing Hadoop Provisioning in the Cloud , 2009, HotCloud.

[12]  Archana Ganapathi,et al.  Predicting and Optimizing System Utilization and Performance via Statistical Machine Learning , 2009 .

[13]  Oleg N. Granichin,et al.  Adaptive autonomous soaring of multiple UAVs using Simultaneous Perturbation Stochastic Approximation , 2010, 49th IEEE Conference on Decision and Control (CDC).

[14]  Shivnath Babu,et al.  Towards automatic optimization of MapReduce programs , 2010, SoCC '10.

[15]  Rajeev Gandhi,et al.  An Analysis of Traces from a Production MapReduce Cluster , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[16]  Herodotos Herodotou,et al.  Profiling, what-if analysis, and cost-based optimization of MapReduce programs , 2011, Proc. VLDB Endow..

[17]  Minli Yao,et al.  SPSA-based step tracking algorithm for mobile DBS reception , 2011, Simul. Model. Pract. Theory.

[18]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[19]  Palden Lama,et al.  AROMA: automated resource allocation and configuration of mapreduce environment in the cloud , 2012, ICAC '12.

[20]  Shalabh Bhatnagar,et al.  Threshold Tuning Using Stochastic Optimization for Graded Signal Control , 2012, IEEE Transactions on Vehicular Technology.

[21]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[22]  Aniruddha S. Gokhale,et al.  A self-tuning system based on application Profiling and Performance Analysis for optimizing Hadoop MapReduce cluster configuration , 2013, 20th Annual International Conference on High Performance Computing.

[23]  Li Zhang,et al.  MRONLINE: MapReduce online performance tuning , 2014, HPDC '14.

[24]  Yi Liu,et al.  JellyFish: Online Performance Tuning with Adaptive Configuration and Elastic Container in Hadoop Yarn , 2015, 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS).

[25]  Shalabh Bhatnagar,et al.  Adaptive System Optimization Using Random Directions Stochastic Approximation , 2015, IEEE Transactions on Automatic Control.

[26]  Shalabh Bhatnagar,et al.  Quasi-Newton smoothed functional algorithms for unconstrained and constrained simulation optimization , 2016, Computational Optimization and Applications.

[27]  Marcos Dias de Assunção,et al.  Apache Spark , 2019, Encyclopedia of Big Data Technologies.