End-to-End Optimization for Geo-Distributed MapReduce

MapReduce has proven remarkably effective for a wide variety of data-intensive applications, but it was designed to run on large single-site homogeneous clusters. Researchers have begun to explore the extent to which the original MapReduce assumptions can be relaxed, including skewed workloads, iterative applications, and heterogeneous computing environments. This paper continues this exploration by applying MapReduce across geo-distributed data over geo-distributed computation resources. Using Hadoop, we show that network and node heterogeneity and the lack of data locality lead to poor performance, because the interaction of MapReduce phases becomes pronounced in the presence of heterogeneous network behavior. To address these problems, we take a two-pronged approach: We first develop a model-driven optimization that serves as an oracle, providing high-level insights. We then apply these insights to design cross-phase optimization techniques that we implement and demonstrate in a real-world MapReduce implementation. Experimental results in both Amazon EC2 and PlanetLab show the potential of these techniques as performance is improved by 7-18 percent depending on the execution environment and application.

[1]  T. N. Vijaykumar,et al.  Tarazu: optimizing MapReduce on heterogeneous clusters , 2012, ASPLOS XVII.

[2]  Rajkumar Buyya,et al.  Scaling MapReduce Applications Across Hybrid Clouds to Meet Soft Deadlines , 2013, 2013 IEEE 27th International Conference on Advanced Information Networking and Applications (AINA).

[3]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[4]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[5]  Ramesh K. Sitaraman,et al.  The Akamai network: a platform for high-performance internet applications , 2010, OPSR.

[6]  Indranil Gupta,et al.  Breaking the MapReduce stage barrier , 2010, 2010 IEEE International Conference on Cluster Computing.

[7]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[8]  Antony I. T. Rowstron,et al.  Bridging the tenant-provider gap in cloud services , 2012, SoCC '12.

[9]  Judy Qiu,et al.  A hierarchical framework for cross-domain MapReduce execution , 2011, ECMLS '11.

[10]  Jim Gray,et al.  A Conversation with Jim Gray , 2003, ACM Queue.

[11]  Ramesh K. Sitaraman,et al.  Optimizing MapReduce for Highly Distributed Environments , 2012, ArXiv.

[12]  Richard Wolski,et al.  The network weather service: a distributed resource performance forecasting service for metacomputing , 1999, Future Gener. Comput. Syst..

[13]  Ling Liu,et al.  Purlieus: Locality-aware resource allocation for MapReduce in a cloud , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[14]  David E. Culler,et al.  PlanetLab: an overlay testbed for broad-coverage services , 2003, CCRV.

[15]  Chenyu Wang,et al.  Exploring MapReduce efficiency with highly-distributed data , 2011, MapReduce '11.

[16]  Herodotos Herodotou,et al.  No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics , 2011, SoCC.

[17]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[18]  Yawei Li,et al.  Megastore: Providing Scalable, Highly Available Storage for Interactive Services , 2011, CIDR.

[19]  Jinoh Kim,et al.  Passive Network Performance Estimation for Large-Scale, Data-Intensive Computing , 2011, IEEE Transactions on Parallel and Distributed Systems.

[20]  Shin Gyu Kim,et al.  Improving Hadoop performance in intercloud environments , 2011, PERV.

[21]  Martin Arlitt,et al.  A workload characterization study of the 1998 World Cup Web site , 2000, IEEE Netw..

[22]  Qi He,et al.  On the predictability of large transfer TCP throughput , 2005, SIGCOMM '05.

[23]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[24]  Michael J. Freedman,et al.  Making Every Bit Count in Wide-Area Analytics , 2013, HotOS.

[25]  Mohammad Hammoud,et al.  Center-of-Gravity Reduce Task Scheduling to Lower MapReduce Network Traffic , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[26]  Thomas Sandholm,et al.  MapReduce optimization using regulated dynamic prioritization , 2009, SIGMETRICS '09.

[27]  Martin Arlitt,et al.  Workload Characterization of the 1998 World Cup Web Site , 1999 .

[28]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[29]  Andrew V. Goldberg,et al.  Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.

[30]  Manish Parashar,et al.  Investigating MapReduce framework extensions for efficient processing of geographically scattered datasets , 2011, PERV.