WaxElephant: A Realistic Hadoop Simulator for Parameters Tuning and Scalability Analysis

MapReduce is becoming the state-of-the-art computation paradigm for processing large-scale datasets on a large cluster with tens or thousands of nodes. Hadoop, an open-source implementation of MapReduce framework, has gained much popularity due to its high scalability and performance. Two challenging issues for a large-scale Hadoop cluster are how to analyze the scalability and identify the optimal parameters configurations. To address these issues, we designed and implemented a Hadoop simulator called Wax Elephant, which provides the following capabilities: (1) loading real MapReduce workloads derived from the historical log of Hadoop clusters, and replaying the job execution history, (2) synthesizing workloads and executing them based on statistical characteristics of workloads, (3) identifying the optimal parameters configurations, and (4) analyzing the scalability of the cluster. Extensive experiments have been conducted to validate the accuracy of the Wax Elephant simulator.

[1]  Rajkumar Buyya,et al.  GridSim: a toolkit for the modeling and simulation of distributed resource management and scheduling for Grid computing , 2002, Concurr. Comput. Pract. Exp..

[2]  Rajkumar Buyya,et al.  CloudSim: A Novel Framework for Modeling and Simulation of Cloud Computing Infrastructures and Services , 2009, ArXiv.

[3]  Nael B. Abu-Ghazaleh,et al.  GPS: a general peer-to-peer simulator and its use for modeling BitTorrent , 2005, 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.

[4]  Roy H. Campbell,et al.  Play It Again, SimMR! , 2011, 2011 IEEE International Conference on Cluster Computing.

[5]  Maozhen Li,et al.  MRSim: A discrete event based MapReduce simulator , 2010, 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery.

[6]  Guanying Wang,et al.  A simulation approach to evaluating design decisions in MapReduce setups , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  Matei Zaharia,et al.  Job Scheduling for Multi-User MapReduce Clusters , 2009 .

[9]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .