On exploring efficient shuffle design for in-memory MapReduce

MapReduce is commonly used as a way of big data analysis in many fields. Shuffling, the inter-node data exchange phase of MapReduce, has been reported as the major bottleneck of the framework. Acceleration of shuffling has been studied in literature, and we raise two questions in this paper. The first question pertains to the effect of Remote Direct Memory Access (RDMA) on the performance of shuffling. RDMA enables one machine to read and write data on the local memory of another and has been known to be an efficient data transfer mechanism. Does the pure use of RDMA affect the performance of shuffling? The second question is the data transfer algorithm to use. There are two types of shuffling algorithms for the conventional MapReduce implementations: Fully-Connected and more sophisticated algorithms such as Pairwise. Does the data transfer algorithm affect the performance of shuffling? To answer these questions, we designed and implemented yet another MapReduce system from scratch in C/C++ to gain the maximum performance and to reserve design flexibility. For the first question, we compared RDMA shuffling based on rsocket with the one based on IPoIB. The results of experiments with GroupBy showed that RDMA accelerates map+shuffle phase by around 50%. For the second question, we first compared our in-memory system with Apache Spark to investigate whether our system performed more efficiently than the existing system. Our system demonstrated performance improvement by a factor of 3.04 on Word Count, and by a factor of 2.64 on BiGram Count as compared to Spark. Then, we compared the two data exchange algorithms, Fully-Connected and Pairwise. The results of experiments with BiGram Count showed that Fully-Connected without RDMA was 13% more efficient than Pairwise with RDMA. We conclude that it is necessary to overlap map and shuffle phases to gain performance improvement. The reason of the relatively small percentage of improvement can be attributed to the time-consuming insertions of key-value pairs into the hash-map in the map phase.

[1]  Steven J. Plimpton,et al.  MapReduce in MPI for Large-scale graph algorithms , 2011, Parallel Comput..

[2]  Minghong Lin,et al.  Joint optimization of overlapping phases in MapReduce , 2013, PERV.

[3]  Dhabaleswar K. Panda,et al.  Accelerating Spark with RDMA for Big Data Processing: Early Experiences , 2014, 2014 IEEE 22nd Annual Symposium on High-Performance Interconnects.

[4]  Bingsheng He,et al.  MrPhi: An Optimized MapReduce Framework on Intel Xeon Phi Coprocessors , 2015, IEEE Transactions on Parallel and Distributed Systems.

[5]  Dhabaleswar K. Panda,et al.  High-Performance Design of YARN MapReduce on Modern HPC Clusters with Lustre and RDMA , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[6]  Haibo Chen,et al.  Tiled-MapReduce: Efficient and Flexible MapReduce Processing on Multicore with Tiling , 2013, TACO.

[7]  A. Davidson Optimizing Shuffle Performance in Spark , 2013 .

[8]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[9]  Naga K. Govindaraju,et al.  Mars: A MapReduce Framework on graphics processors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[10]  Jeffrey D. Ullman MapReduce Algorithms , 2015, CODS Companion Volume.

[11]  Gagan Agrawal,et al.  Optimizing MapReduce for GPUs with effective shared memory usage , 2012, HPDC '12.

[12]  Motohiko Matsuda,et al.  K MapReduce: A scalable tool for data-processing and search/ensemble applications on large-scale supercomputers , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[13]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[14]  Dhabaleswar K. Panda,et al.  HOMR: a hybrid approach to exploit maximum overlapping in MapReduce over high performance interconnects , 2014, ICS '14.

[15]  Dhabaleswar K. Panda,et al.  High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[16]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[17]  Justin Talbot,et al.  Phoenix++: modular MapReduce for shared-memory systems , 2011, MapReduce '11.

[18]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[19]  Xiaobo Zhou,et al.  iShuffle: Improving Hadoop Performance with Shuffle-on-Write , 2017, IEEE Transactions on Parallel and Distributed Systems.

[20]  Seyong Lee,et al.  MapReduce with communication overlap (MaRCO) , 2013, J. Parallel Distributed Comput..