MR-COF: A Genetic MapReduce Configuration Optimization Framework

Hadoop/MapReduce has emerged as a de facto programming framework to explore cloud-computing resources. Hadoop has many configuration parameters, some of which are crucial to the performance of MapReduce jobs. In practice, these parameters are usually set to default or inappropriate values. This severely limits system performance (e.g., execution time). Therefore, it is essential but also challenging to investigate how to automatically tune these parameters to optimize MapReduce job performance. In this paper, we propose an automatic MapReduce configuration optimization framework named as MR-COF. By monitoring and analyzing the runtime behavior, the framework adopts a cost-based performance prediction model that predicts the MapReduce job performance. In addition, we design a genetic search algorithm which iteratively tunes parameters in order to find out the best one. Testbed-based experimental results show that the average MapReduce job performance is increased by 35 % with MR-COF compared to the default configuration.

[1]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[2]  Keke Chen,et al.  CRESP: Towards Optimal Resource Provisioning for MapReduce Computing in Public Clouds , 2014, IEEE Transactions on Parallel and Distributed Systems.

[3]  Boon Thau Loo,et al.  Parameterizable benchmarking framework for designing a MapReduce performance model , 2014, Concurr. Comput. Pract. Exp..

[4]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[5]  Shivnath Babu,et al.  Towards automatic optimization of MapReduce programs , 2010, SoCC '10.

[6]  Keke Chen,et al.  Towards Optimal Resource Provisioning for Running MapReduce Programs in Public Clouds , 2011, 2011 IEEE 4th International Conference on Cloud Computing.

[7]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[8]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[9]  Dick H. J. Epema,et al.  Towards Machine Learning-Based Auto-tuning of MapReduce , 2013, 2013 IEEE 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems.

[10]  Guanying Wang,et al.  A simulation approach to evaluating design decisions in MapReduce setups , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[11]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[12]  Himabindu Pucha,et al.  Towards Optimizing Hadoop Provisioning in the Cloud , 2009, HotCloud.

[13]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[14]  Anthony K. H. Tung,et al.  MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters , 2011, IEEE Transactions on Knowledge and Data Engineering.

[15]  Kushal Datta,et al.  Gunther: Search-Based Auto-Tuning of MapReduce , 2013, Euro-Par.

[16]  Li Zhang,et al.  MRONLINE: MapReduce online performance tuning , 2014, HPDC '14.

[17]  Li Hao Research on Performance Optimization Approach of Data-intensive Application with MapReduce , 2010 .

[18]  Chen Wang,et al.  MRTuner: A Toolkit to Enable Holistic Optimization for MapReduce Jobs , 2014, Proc. VLDB Endow..

[19]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[20]  Jorge-Arnulfo Quiané-Ruiz,et al.  Efficient Big Data Processing in Hadoop MapReduce , 2012, Proc. VLDB Endow..

[21]  Roy H. Campbell,et al.  Play It Again, SimMR! , 2011, 2011 IEEE International Conference on Cluster Computing.

[22]  Lalit M. Patnaik,et al.  Genetic algorithms: a survey , 1994, Computer.

[23]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[24]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[25]  Beng Chin Ooi,et al.  The performance of MapReduce , 2010, Proc. VLDB Endow..