Predator — An experience guided configuration optimizer for Hadoop MapReduce

MapReduce is a distributed computing programming framework which provides an effective solution to the data processing challenge. As an open-source implementation of MapReduce, Hadoop has been widely used in practice. The performance of Hadoop MapReduce heavily depends on its configuration settings, so tuning these configuration parameters could be an effective way to improve its performance. However, picking out the optimal configuration settings is not easy for the time consuming nature of MapReduce together with the high dimensional and nonlinear features of its configuration optimization. In this paper, we introduce Predator, an experience guided configuration optimizer, which does not treat the optimization problem as a pure black-box problem but utilizes useful experience learnt from Hadoop MapReduce configuration practice to assist the optimizing process. The optimizer uses job execution time estimated by a practical MapReduce cost model as the objective function, and classifies Hadoop MapReduce parameters into different groups by their different tunable levels to shrink search space. Furthermore, the optimization algorithm of the optimizer uses the idea of subspace division to prevent local optimum problem, and it could also reduce the searching time by cutting down the cost in visiting unpromising points in search space. Experiments on Hadoop clusters demonstrate the effectiveness and efficiency of the optimizer.

[1]  Tao Ye,et al.  A recursive random search algorithm for large-scale network parameter configuration , 2003, SIGMETRICS '03.

[2]  Wei Zheng,et al.  Automatic configuration of internet services , 2007, EuroSys '07.

[3]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[4]  Optimizing Hadoop * Deployments , 2010 .

[5]  Shrinivas B. Joshi,et al.  Apache hadoop performance-tuning methodologies and best practices , 2012, ICPE '12.

[6]  Haifeng Chen,et al.  Experience Transfer for the Configuration Tuning in Large-Scale Computing Systems , 2009, IEEE Transactions on Knowledge and Data Engineering.

[7]  Peter Norvig,et al.  Artificial intelligence - a modern approach, 2nd Edition , 2003, Prentice Hall series in artificial intelligence.

[8]  Blai Bonet,et al.  Planning as heuristic search , 2001, Artif. Intell..

[9]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[10]  Biplab Sikdar,et al.  Traffic management and network control using collaborative on-line simulation , 2001, ICC 2001. IEEE International Conference on Communications. Conference Record (Cat. No.01CH37240).

[11]  Haifeng Chen,et al.  Autotuning Configurations in Distributed Systems for Performance Improvements Using Evolutionary Strategies , 2008, 2008 The 28th International Conference on Distributed Computing Systems.

[12]  Bowei Xi,et al.  A smart hill-climbing algorithm for application server configuration , 2004, WWW '04.

[13]  Donald R. Jones,et al.  Efficient Global Optimization of Expensive Black-Box Functions , 1998, J. Glob. Optim..

[14]  Himabindu Pucha,et al.  Towards Optimizing Hadoop Provisioning in the Cloud , 2009, HotCloud.

[15]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[16]  Herodotos Herodotou,et al.  Profiling, what-if analysis, and cost-based optimization of MapReduce programs , 2011, Proc. VLDB Endow..

[17]  Meng Wang,et al.  A Practical Performance Model for Hadoop MapReduce , 2012, 2012 IEEE International Conference on Cluster Computing Workshops.