Hadoop Performance Self-Tuning Using a Fuzzy-Prediction Approach

The Apache Hadoop framework (currently known as YARN) is a widely used open-source implementation of MapReduce (MR). Manual tuning of Hadoop performance is hard and time-consuming so several self-tuning approaches have been proposed. This paper proposes an approach that avoids problems of previous self-tuning approaches based on performance models or resource usage, namely 1) need for a time-consuming training phase, typically offline, 2) unsuitability for Hadoop environments with concurrently running MR jobs, and 3) need for modification of the Hadoop framework itself. The proposed approach uses a fuzzy-prediction controller for self-optimization of the number of concurrent MR jobs. The fuzzy-prediction controller learns from past and current resource usage of MR jobs and from the number of concurrent tasks. It both uses and constructs rules in real time to predict the resource usage and the number of concurrent tasks. It does not require offline training or any modification of either the MR jobs or the Hadoop framework. The predicted values are used to dynamically control the number of concurrent ApplicationMasters (AMs) (i.e., MR jobs in RUNNING state). Experimental evaluation of the proposed approach on a 7-node cluster (1 master node and 6 slave nodes) running 30-job sequences combining three different types of MR jobs (Terasort, Grep and Wordcount) showed up to 29% performance improvement over Hadoop default configurations. The new approach improves the aggregate performThe Apache Hadoop framework (currently known as YARN) is a widely used open-source implementation of MapReduce (MR). Manual tuning of Hadoop performance is hard and time-consuming so several self-tuning approaches have been proposed. This paper proposes an approach that avoids problems of previous self-tuning approaches based on performance models or resource usage, namely 1) need for a time-consuming training phase, typically offline, 2) unsuitability for Hadoop environments with concurrently running MR jobs, and 3) need for modification of the Hadoop framework itself. The proposed approach uses a fuzzyprediction controller for self-optimization of the number of concurrent MR jobs. The fuzzy-prediction controller learns from past and current resource usage of MR jobs and from the number of concurrent tasks. It both uses and constructs rules in real time to predict the resource usage and the number of concurrent tasks. It does not require offline training or any modification of either the MR jobs or the Hadoop framework. The predicted values are used to dynamically control the number of concurrent ApplicationMasters (AMs) (i.e., MR jobs in RUNNING state). Experimental evaluation of the proposed approach on a 7-node cluster (1 master node and 6 slave nodes) running 30-job sequences combining three different types of MR jobs (Terasort, Grep and Wordcount) showed up to 29% performance improvement over Hadoop default configurations. The new approach improves the aggregate performance of MR jobs by adjusting a single YARN parameter.ance of MR jobs by adjusting a single YARN parameter.

[1]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[2]  Lieven Eeckhout,et al.  RFHOC: A Random-Forest Approach to Auto-Tuning Hadoop's Configuration , 2016, IEEE Transactions on Parallel and Distributed Systems.

[3]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[4]  Mahmut T. Kandemir,et al.  Panacea: towards holistic optimization of MapReduce applications , 2012, CGO '12.

[5]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[6]  Kushal Datta,et al.  Gunther: Search-Based Auto-Tuning of MapReduce , 2013, Euro-Par.

[7]  Li Zhang,et al.  MRONLINE: MapReduce online performance tuning , 2014, HPDC '14.

[8]  Thomas Hofmann,et al.  Map-Reduce for Machine Learning on Multicore , 2007 .

[9]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[10]  Vincent W. Freeh,et al.  Dynamically Controlling Node-Level Parallelism in Hadoop , 2015, 2015 IEEE 8th International Conference on Cloud Computing.

[11]  Bo Zhang,et al.  Self-Configuration of the Number of Concurrently Running MapReduce Jobs in a Hadoop Cluster , 2015, 2015 IEEE International Conference on Autonomic Computing.

[12]  Jing Xu,et al.  Autonomic resource management in virtualized data centers using fuzzy logic-based approaches , 2008, Cluster Computing.

[13]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[14]  Christopher Ré,et al.  Automatic Optimization for MapReduce Programs , 2011, Proc. VLDB Endow..

[15]  Reza Bosagh Zadeh,et al.  Dimension Independent Matrix Square using MapReduce , 2013, ArXiv.

[16]  Dick H. J. Epema,et al.  Towards Machine Learning-Based Auto-tuning of MapReduce , 2013, 2013 IEEE 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems.

[17]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.