Virtual Shuffling for Efficient Data Movement in MapReduce

MapReduce is a popular parallel processing framework for large-scale data analytics. To keep up with the increasing volume of datasets, it requires efficient I/O capability from the underlying computer systems to process and analyze data in two phases (mapping and reducing). Between these phases, MapReduce requires a shuffling phase to globally exchange the intermediate data generated by the mapping phase. We reveal that data shuffling, by physically moving segments of intermediate data across disks, causes significant I/O contention and compounds the I/O problem. In this paper, we propose a novel virtual shuffling strategy to enable efficient data movement and reduce I/O for MapReduce shuffling, thereby reducing power consumption and conserving energy. Virtual shuffling is realized through a combination of three techniques including a three-level segment table, near-demand merging, and dynamic and balanced merging subtrees. Our experimental results show that virtual shuffling significantly speeds up data movement in MapReduce and achieves faster job execution. Particularly, its reduction in disk I/O accesses results in as much as 12% savings in power consumption for MapReduce programs.

[1]  Archana Ganapathi,et al.  To compress or not to compress - compute vs. IO tradeoffs for mapreduce energy efficiency , 2010, Green Networking '10.

[2]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[3]  Herodotos Herodotou,et al.  Profiling, what-if analysis, and cost-based optimization of MapReduce programs , 2011, Proc. VLDB Endow..

[4]  Shin Gyu Kim,et al.  Harnessing input redundancy in a MapReduce framework , 2010, SAC '10.

[5]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[6]  Jie Huang,et al.  HiTune: Dataflow-Based Performance Analysis for Big Data Cloud , 2011, USENIX Annual Technical Conference.

[7]  Weikuan Yu,et al.  Hadoop acceleration through network levitated merge , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[8]  Ling Liu,et al.  Purlieus: Locality-aware resource allocation for MapReduce in a cloud , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[9]  Shivnath Babu,et al.  Towards automatic optimization of MapReduce programs , 2010, SoCC '10.

[10]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[11]  Beng Chin Ooi,et al.  The performance of MapReduce , 2010, Proc. VLDB Endow..

[12]  Xiaoqiao Meng,et al.  Delay tails in MapReduce scheduling , 2012, SIGMETRICS '12.

[13]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[14]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[15]  Christoforos E. Kozyrakis,et al.  On the energy (in)efficiency of Hadoop clusters , 2010, OPSR.

[16]  S. Habib,et al.  Introducing map-reduce to high end computing , 2008, 2008 3rd Petascale Data Storage Workshop.

[17]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[18]  Archana Ganapathi,et al.  Statistical Workloads for Energy Efficient MapReduce , 2010 .

[19]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[20]  Xin Yang,et al.  Mammoth: autonomic data processing framework for scientific state-transition applications , 2013, CAC.

[21]  Albert G. Greenberg,et al.  Scarlett: coping with skewed content popularity in mapreduce clusters , 2011, EuroSys '11.

[22]  Christopher Ré,et al.  Automatic Optimization for MapReduce Programs , 2011, Proc. VLDB Endow..

[23]  Jignesh M. Patel,et al.  Energy management for MapReduce clusters , 2010, Proc. VLDB Endow..