MapReduce Workload Modeling with Statistical Approach

Large-scale data-intensive cloud computing with the MapReduce framework is becoming pervasive for the core business of many academic, government, and industrial organizations. Hadoop, a state-of-the-art open source project, is by far the most successful realization of MapReduce framework. While MapReduce is easy- to-use, efficient and reliable for data-intensive computations, the excessive configuration parameters in Hadoop impose unexpected challenges on running various workloads with a Hadoop cluster effectively. Consequently, developers who have less experience with the Hadoop configuration system may devote a significant effort to write an application with poor performance, either because they have no idea how these configurations would influence the performance, or because they are not even aware that these configurations exist. There is a pressing need for comprehensive analysis and performance modeling to ease MapReduce application development and guide performance optimization under different Hadoop configurations. In this paper, we propose a statistical analysis approach to identify the relationships among workload characteristics, Hadoop configurations and workload performance. We apply principal component analysis and cluster analysis to 45 different metrics, which derive relationships between workload characteristics and corresponding performance under different Hadoop configurations. Regression models are also constructed that attempt to predict the performance of various workloads under different Hadoop configurations. Several non-intuitive relationships between workload characteristics and performance are revealed through our analysis and the experimental results demonstrate that our regression models accurately predict the performance of MapReduce workloads under different Hadoop configurations.

[1]  Siegfried Benkner,et al.  An Adaptive Framework for the Execution of Data-Intensive MapReduce Applications in the Cloud , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[2]  Shivnath Babu,et al.  Towards automatic optimization of MapReduce programs , 2010, SoCC '10.

[3]  Alexander J. Smola,et al.  Support Vector Method for Function Approximation, Regression Estimation and Signal Processing , 1996, NIPS.

[4]  Albert Y. Zomaya,et al.  Preliminary Results: Modeling Relation Between Total Execution Time of MapReduce Applications and Number of Mappers/Reducers , 2011 .

[5]  Karsten Schwan,et al.  IQ-Paths: Predictably High Performance Data Streams Across Dynamic Network Overlays , 2006, 2006 15th IEEE International Conference on High Performance Distributed Computing.

[6]  Matei Zaharia,et al.  Job Scheduling for Multi-User MapReduce Clusters , 2009 .

[7]  Ashutosh Kumar Singh,et al.  Factor analytical approaches for evaluating groundwater trace element chemistry data , 2003 .

[8]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[9]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[10]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  Guanying Wang,et al.  Using realistic simulation for performance analysis of mapreduce setups , 2009, LSAP '09.

[13]  Christopher Ré,et al.  Automatic Optimization for MapReduce Programs , 2011, Proc. VLDB Endow..

[14]  Bryan F. J. Manly,et al.  Multivariate Statistical Methods : A Primer , 1986 .

[15]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[16]  Maozhen Li,et al.  MRSim: A discrete event based MapReduce simulator , 2010, 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery.

[17]  Beng Chin Ooi,et al.  The performance of MapReduce , 2010, Proc. VLDB Endow..

[18]  Dan C. Marinescu,et al.  Algorithms for Divisible Load Scheduling of Data-intensive Applications , 2010, Journal of Grid Computing.

[19]  Marios D. Dikaiakos,et al.  Searching for Software on the EGEE Infrastructure , 2010, Journal of Grid Computing.

[20]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[21]  Dimitrios Katsaros,et al.  Architectural Requirements for Cloud Computing Systems: An Enterprise Cloud Approach , 2011, Journal of Grid Computing.

[22]  Archana Ganapathi,et al.  Statistical Workloads for Energy Efficient MapReduce , 2010 .

[23]  Michael Thomas,et al.  Data Intensive and Network Aware (DIANA) Grid Scheduling , 2007, Journal of Grid Computing.

[24]  Andrew V. Goldberg,et al.  Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.

[25]  Douglas Thain,et al.  Chirp: a practical global filesystem for cluster and Grid computing , 2008, Journal of Grid Computing.

[26]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[27]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).