MRA++: Scheduling and data placement on MapReduce for heterogeneous environments

MapReduce has emerged as a popular programming model in the field of data-intensive computing. This is due to its simplistic design, which provides ease of use for programmers, and its framework implementations such as Hadoop, which have been adopted by large business and technology companies. In this paper we make some improvements to the Hadoop MapReduce framework by introducing algorithms that are suitable for heterogeneous environments. The goal is to efficiently perform data-intensive computing in heterogeneous environments. The need for these adaptations derives from the fact that, following the framework design proposed by Google, Hadoop is optimized to run in large homogeneous clusters. Hence we propose MRA++, a new MapReduce framework design that considers the heterogeneity of nodes during data distribution, task scheduling and job control. MRA++establishes a training task to gather information prior to the data distribution. However, we show that the delay introduced in the setup phase is offset by the effectiveness of the mechanisms and algorithms, that achieve performance gains of more than 70% in 10 Mbps networks.

[1]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[2]  David P. Anderson,et al.  A Correlated Resource Model of Internet End Hosts , 2012, IEEE Transactions on Parallel and Distributed Systems.

[3]  Quan Chen,et al.  SAMR: A Self-adaptive MapReduce Scheduling Algorithm in Heterogeneous Environment , 2010, 2010 10th IEEE International Conference on Computer and Information Technology.

[4]  Luciana Arantes,et al.  MRSG - A MapReduce simulator over SimGrid , 2013, Parallel Comput..

[5]  Beth Plale,et al.  Storm surge simulation and load balancing in Azure cloud , 2013, SpringSim.

[6]  Nikolaus Augsten,et al.  Handling Data Skew in MapReduce , 2011, CLOSER.

[7]  Guanying Wang,et al.  Using realistic simulation for performance analysis of mapreduce setups , 2009, LSAP '09.

[8]  Stuart Bailey,et al.  Hadoop Acceleration in an OpenFlow-Based Cluster , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[9]  Ronald L. Rivest,et al.  Expected time bounds for selection , 1975, Commun. ACM.

[10]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[11]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[12]  Yun Tian,et al.  Improving MapReduce performance through data placement in heterogeneous Hadoop clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[13]  S. Habib,et al.  Introducing map-reduce to high end computing , 2008, 2008 3rd Petascale Data Storage Workshop.

[14]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[15]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[16]  Arnaud Legrand,et al.  Accuracy study and improvement of network simulation in the SimGrid framework , 2009, SIMUTools 2009.

[17]  Shantenu Jha,et al.  Pilot-MapReduce: an extensible and flexible MapReduce implementation for distributed data , 2012, MapReduce '12.

[18]  Weizhong Zhao,et al.  h-MapReduce: A Framework for Workload Balancing in MapReduce , 2013, 2013 IEEE 27th International Conference on Advanced Information Networking and Applications (AINA).

[19]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[20]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[21]  Douglas Thain,et al.  All-Pairs: An Abstraction for Data-Intensive Computing on Campus Grids , 2010, IEEE Transactions on Parallel and Distributed Systems.

[22]  Xiaolin Hu,et al.  Cloud MapReduce for particle filter-based data assimilation for wildfire spread simulation , 2013, SpringSim.

[23]  Maozhen Li,et al.  MRSim: A discrete event based MapReduce simulator , 2010, 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery.