Performance Variations in Resource Scaling for MapReduce Applications on Private and Public Clouds

In this paper, we delineate the causes of performance variations when scaling provisioned virtual resources for a variety of MapReduce applications. Hadoop MapReduce facilitates the development and execution processes of large-scale batch applications on big data. However, provisioning suitable resources to achieve desired performance at an affordable cost requires expertise into the execution model of MapReduce, the resources available for provisioning and the execution behavior of the application at hand. As an initial step towards automating this process, we characterize the difference in execution response for different MapReduce applications while varying the number of virtualized CPUs and memory resources, number of map slots as well as cluster size on a private cloud. This characterization helps illustrate the performance variation, 5x compared to 36x speedup, of Reduce-intensive and Map-intensive applications at effectively utilizing provisioned resources at different scales (1-64 VMs). By comparing the scalability efficiency, we clearly indicate the under-provisioning or over-provisioning of resources for different MapReduce applications at large scale.

[1]  Jianling Sun,et al.  An analytical performance model of MapReduce , 2011, 2011 IEEE International Conference on Cloud Computing and Intelligence Systems.

[2]  Seyong Lee,et al.  PUMA: Purdue MapReduce Benchmarks Suite , 2012 .

[3]  Guanying Wang,et al.  A simulation approach to evaluating design decisions in MapReduce setups , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[4]  Cristina L. Abad,et al.  A storage-centric analysis of MapReduce workloads: File popularity, temporal locality and arrival patterns , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[5]  Weisong Shi,et al.  Workload characterization on a production Hadoop cluster: A case study on Taobao , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[6]  Fan Zhang,et al.  Cluster-Size Scaling and MapReduce Execution Times , 2013, 2013 IEEE 5th International Conference on Cloud Computing Technology and Science.

[7]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[8]  Chenyu Wang,et al.  Cross-Phase Optimization in MapReduce , 2013, 2013 IEEE International Conference on Cloud Engineering (IC2E).

[9]  Keke Chen,et al.  Towards Optimal Resource Provisioning for Running MapReduce Programs in Public Clouds , 2011, 2011 IEEE 4th International Conference on Cloud Computing.

[10]  Christoforos E. Kozyrakis,et al.  Phoenix rebirth: Scalable MapReduce on a large-scale shared-memory system , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[11]  Geoffrey C. Fox,et al.  Investigation of Data Locality in MapReduce , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[12]  Fan Zhang,et al.  Dataset Scaling and MapReduce Performance , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[13]  Beng Chin Ooi,et al.  The performance of MapReduce , 2010, Proc. VLDB Endow..

[14]  Radu Sion,et al.  Enhancement of Xen's scheduler for MapReduce workloads , 2011, HPDC '11.

[15]  Mohammad Hammoud,et al.  MC2: Map Concurrency Characterization for MapReduce on the Cloud , 2013, 2013 IEEE Sixth International Conference on Cloud Computing.

[16]  Herodotos Herodotou,et al.  Profiling, what-if analysis, and cost-based optimization of MapReduce programs , 2011, Proc. VLDB Endow..

[17]  Lavanya Ramakrishnan,et al.  Benchmarking MapReduce Implementations for Application Usage Scenarios , 2011, 2011 IEEE/ACM 12th International Conference on Grid Computing.

[18]  Wu-chun Feng,et al.  MOON: MapReduce On Opportunistic eNvironments , 2010, HPDC '10.

[19]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[20]  Herodotos Herodotou,et al.  MapReduce programming and cost-based optimization? , 2011, Proc. VLDB Endow..

[21]  Chunjie Luo,et al.  Characterizing data analysis workloads in data centers , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[22]  Geoffrey C. Fox,et al.  MapReduce for Data Intensive Scientific Analyses , 2008, 2008 IEEE Fourth International Conference on eScience.