Performance Evaluation and Estimation Model Using Regression Method for Hadoop WordCount

Given the rapid growth in cloud computing, it is important to analyze the performance of different Hadoop MapReduce applications and to understand the performance bottleneck in a cloud cluster that contributes to higher or lower performance. It is also important to analyze the underlying hardware in cloud cluster servers to enable the optimization of software and hardware to achieve the maximum performance possible. Hadoop is based on MapReduce, which is one of the most popular programming models for big data analysis in a parallel computing environment. In this paper, we present a detailed performance analysis, characterization, and evaluation of Hadoop MapReduce WordCount application. We also propose an estimation model based on Amdahl's law regression method to estimate performance and total processing time versus different input sizes for a given processor architecture. The estimation regression model is verified to estimate performance and run time with an error margin of <;5%.

[1]  Beng Chin Ooi,et al.  The performance of MapReduce , 2010, Proc. VLDB Endow..

[2]  Parth Gohil,et al.  A performance analysis of MapReduce applications on big data in cloud based Hadoop , 2014, International Conference on Information Communication and Embedded Systems (ICICES2014).

[3]  Andrea C. Arpaci-Dusseau,et al.  Explicit Control in the Batch-Aware Distributed File System , 2004, NSDI.

[4]  Lieven Eeckhout,et al.  Analyzing commercial processor performance numbers for predicting performance of applications of interest , 2007, SIGMETRICS '07.

[5]  Zvi M. Kedem,et al.  Charlotte: Metacomputing on the Web , 1999, Future Gener. Comput. Syst..

[6]  Dan Wu,et al.  Understanding the Impacts of Solid-State Storage on the Hadoop Performance , 2013, 2013 International Conference on Advanced Cloud and Big Data.

[7]  Christoforos E. Kozyrakis,et al.  On the energy (in)efficiency of Hadoop clusters , 2010, OPSR.

[8]  Randy H. Katz,et al.  An energy case for hybrid datacenters , 2010, OPSR.

[9]  Guillaume Pierre,et al.  EC2 Performance Analysis for Resource Provisioning of Service-Oriented Applications , 2009, ICSOC/ServiceWave Workshops.

[10]  Rajkumar Buyya,et al.  Market-Oriented Cloud Computing: Vision, Hype, and Reality for Delivering IT Services as Computing Utilities , 2008, 2008 10th IEEE International Conference on High Performance Computing and Communications.

[11]  Hans-Wolfgang Loidl,et al.  Comparing High Level MapReduce Query Languages , 2011, APPT.

[12]  Joseph Issa,et al.  Hadoop and memcached: Performance and power characterization and analysis , 2012, Journal of Cloud Computing: Advances, Systems and Applications.

[13]  John Shalf,et al.  Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[14]  Kevin Wilkinson,et al.  Modeling the Performance of the Hadoop Online Prototype , 2011, 2011 23rd International Symposium on Computer Architecture and High Performance Computing.

[15]  S. Krishnaprasad,et al.  Uses and abuses of Amdahl's law , 2001 .

[16]  Eric A. Brewer,et al.  Cluster-based scalable network services , 1997, SOSP.

[17]  Yon Dohn Chung,et al.  Parallel data processing with MapReduce: a survey , 2012, SGMD.

[18]  GhemawatSanjay,et al.  The Google file system , 2003 .

[19]  Joseph Issa,et al.  Graphics performance analysis using Amdahl's law , 2010, Proceedings of the 2010 International Symposium on Performance Evaluation of Computer and Telecommunication Systems (SPECTS '10).

[20]  Hai Jin,et al.  Evaluating MapReduce on Virtual Machines: The Hadoop Case , 2009, CloudCom.