Analyzing Cost Parameters Affecting Map Reduce Application Performance

Recently, big data analysis has become an imperative task for many big companies. Map-Reduce, an emerging distributed computing paradigm, is known as a promising architecture for big data analytics on commodity hardware. Map-Reduce, and its open source implementation Hadoop, have been extensively accepted by several companies due to their salient features such as scalability, elasticity, fault-tolerance and flexibility to handle big data. However, these benefits entail a considerable performance sacrifice. The performance of a Map-Reduce application depends on various factors including the size of the input data set, cluster resource settings etc. A clear understanding of the factors that affect Map-Reduce application performance and the cost associated with those factors is required. In this paper, we study different performance parameters and an existing Cost Optimizer that computes the cost of Map-Reduce job execution. The cost based optimizer also considers various configuration parameters available in Hadoop that affect performance of these programs. This paper is an attempt to analyze the Map-Reduce application performance and identifying the key factors affecting the cost and performance of executing Map-Reduce applications.

[1]  Shivnath Babu,et al.  Towards automatic optimization of MapReduce programs , 2010, SoCC '10.

[2]  S. Taruna,et al.  Efficient data layouts for cost-optimized Map-Reduce operations , 2015, 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom).

[3]  Herodotos Herodotou,et al.  Profiling, what-if analysis, and cost-based optimization of MapReduce programs , 2011, Proc. VLDB Endow..

[4]  Christos Doulkeridis,et al.  A survey of large-scale analytical query processing in MapReduce , 2013, The VLDB Journal.

[5]  Herodotos Herodotou,et al.  MapReduce programming and cost-based optimization? , 2011, Proc. VLDB Endow..

[6]  Kenneth Wottrich The Performance Characteristics of MapReduce Applications on Scalable Clusters , 2011 .

[7]  Jorge-Arnulfo Quiané-Ruiz,et al.  Efficient Big Data Processing in Hadoop MapReduce , 2012, Proc. VLDB Endow..

[8]  Rong Gu,et al.  SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters , 2014, J. Parallel Distributed Comput..

[9]  Shicong Meng,et al.  Improving ReduceTask data locality for sequential MapReduce jobs , 2013, 2013 Proceedings IEEE INFOCOM.

[10]  Murali S. Kodialam,et al.  Scheduling in mapreduce-like systems for fast completion time , 2011, 2011 Proceedings IEEE INFOCOM.

[11]  Jignesh M. Patel,et al.  Column-Oriented Storage Techniques for MapReduce , 2011, Proc. VLDB Endow..

[12]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.