Dataset Scaling and MapReduce Performance

Predicting execution behavior of MapReduce applications when scaling the input dataset presents a challenging problem. The difficulty lies in the distributed locations of input data and the distributed, virtualized compute resources that utilize different network substrates. The potential payoff lies in using small datasets and limited test runs to understand how applications will behave with "big data." Our research has developed an in-depth understanding of MapReduce application performance and analyzed the impact of scaling input datasets. While we might expect that "embarrassingly parallel" MapReduce jobs should scale linearly with input dataset size, our results show that execution time sometimes increases nonlinearly. To verify our predictions, we identify a benchmark set of Map-, Shuffle-, and Reduce-intensive applications. Experimental results show that our execution-time analysis distinguishes four typical application behaviors when scaling input datasets.

[1]  Mohammad Hammoud,et al.  MC2: Map Concurrency Characterization for MapReduce on the Cloud , 2013, 2013 IEEE Sixth International Conference on Cloud Computing.

[2]  Joshua R. Smith,et al.  LIGO: The laser interferometer gravitational-wave observatory , 2006, QELS 2006.

[3]  Julien Masanès,et al.  Web Archiving , 2014, Encyclopedia of Social Network Analysis and Mining.

[4]  Jianling Sun,et al.  An analytical performance model of MapReduce , 2011, 2011 IEEE International Conference on Cloud Computing and Intelligence Systems.

[5]  Herodotos Herodotou Hadoop Performance Models , 2011, ArXiv.

[6]  Geoffrey C. Fox,et al.  Investigation of Data Locality in MapReduce , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[7]  Herodotos Herodotou,et al.  Profiling, what-if analysis, and cost-based optimization of MapReduce programs , 2011, Proc. VLDB Endow..

[8]  M. Balazinska,et al.  A Study of Skew in MapReduce Applications , 2011 .

[9]  Sergei Vassilvitskii,et al.  A model of computation for MapReduce , 2010, SODA '10.

[10]  Beng Chin Ooi,et al.  The performance of MapReduce , 2010, Proc. VLDB Endow..

[11]  Michael I. Jordan,et al.  Managing data transfers in computer clusters with orchestra , 2011, SIGCOMM.

[12]  Jimmy J. Lin,et al.  Book Reviews: Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer , 2010, CL.

[13]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[14]  Keke Chen,et al.  Towards Optimal Resource Provisioning for Running MapReduce Programs in Public Clouds , 2011, 2011 IEEE 4th International Conference on Cloud Computing.

[15]  Kenneth Wottrich The Performance Characteristics of MapReduce Applications on Scalable Clusters , 2011 .

[16]  Mohammad Hammoud,et al.  Center-of-Gravity Reduce Task Scheduling to Lower MapReduce Network Traffic , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[17]  Hai Jin,et al.  LEEN: Locality/Fairness-Aware Key Partitioning for MapReduce in the Cloud , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[18]  Craig MacDonald,et al.  MapReduce indexing strategies: Studying scalability and efficiency , 2012, Inf. Process. Manag..

[19]  Cheng Wu,et al.  AMREF: An Adaptive MapReduce Framework for Real Time Applications , 2010, 2010 Ninth International Conference on Grid and Cloud Computing.

[20]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.