Characterizing and benchmarking stand-alone Hadoop MapReduce on modern HPC clusters

With the emergence of high-performance data analytics, the Hadoop platform is being increasingly used to process data stored on high-performance computing clusters. While there is immense scope for improving the performance of Hadoop MapReduce (including the network-intensive shuffle phase) over these modern clusters, that are equipped with high-speed interconnects such as InfiniBand and 10/40 GigE, and storage systems such as SSDs and Lustre, it is essential to study the MapReduce component in an isolated manner. In this paper, we study popular MapReduce workloads, obtained from well-accepted, comprehensive benchmark suites, to identify common shuffle data distribution patterns. We determine different environmental and workload-specific factors that affect the performance of the MapReduce job. Based on these characterization studies, we propose a micro-benchmark suite that can be used to evaluate the performance of stand-alone Hadoop MapReduce, and demonstrate its ease-of-use with different networks/protocols, Hadoop distributions, and storage architectures. Performance evaluations with our proposed micro-benchmarks show that stand-alone Hadoop MapReduce over IPoIB performs better than 10 GigE by about 13–15 %, and the RDMA-enhanced hybrid MapReduce design can achieve up to 43 % performance improvement over default Hadoop MapReduce over IPoIB, in both shared-nothing and shared storage architectures.

[1]  Kiyoung Kim,et al.  MRBench: A Benchmark for MapReduce Framework , 2008, 2008 14th IEEE International Conference on Parallel and Distributed Systems.

[2]  Yuqing Zhu,et al.  BigDataBench: A big data benchmark suite from internet services , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[3]  Sara Bouchenak,et al.  MRBS: Towards Dependability Benchmarking for Hadoop MapReduce , 2012, Euro-Par Workshops.

[4]  Sally A. McKee,et al.  Characterizing and subsetting big data workloads , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[5]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[6]  Dhabaleswar K. Panda,et al.  A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters , 2012, WBDB.

[7]  Magdalena Balazinska,et al.  Managing Skew in Hadoop , 2013, IEEE Data Eng. Bull..

[8]  Xiaobo Zhou,et al.  iShuffle: Improving Hadoop Performance with Shuffle-on-Write , 2013, ICAC 2013.

[9]  Xiaobo Zhou,et al.  iShuffle: Improving Hadoop Performance with Shuffle-on-Write , 2017, IEEE Transactions on Parallel and Distributed Systems.

[10]  Dhabaleswar K. Panda,et al.  A Micro-benchmark Suite for Evaluating Hadoop MapReduce on High-Performance Networks , 2014, BPOE@ASPLOS/VLDB.

[11]  Dhabaleswar K. Panda,et al.  MapReduce over Lustre: Can RDMA-Based Approach Benefit? , 2014, Euro-Par.

[12]  Dhabaleswar K. Panda,et al.  Accelerating Spark with RDMA for Big Data Processing: Early Experiences , 2014, 2014 IEEE 22nd Annual Symposium on High-Performance Interconnects.

[13]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[14]  Srikanth Kandula,et al.  PACMan: Coordinated Memory Caching for Parallel Jobs , 2012, NSDI.

[15]  Dhabaleswar K. Panda,et al.  SOR-HDFS: a SEDA-based approach to maximize overlapping in RDMA-enhanced HDFS , 2014, HPDC '14.

[16]  Dhabaleswar K. Panda,et al.  A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks , 2013, WBDB.

[17]  Dhabaleswar K. Panda,et al.  HOMR: a hybrid approach to exploit maximum overlapping in MapReduce over high performance interconnects , 2014, ICS '14.

[18]  Dhabaleswar K. Panda,et al.  High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[19]  Dhabaleswar K. Panda,et al.  High-Performance Design of Hadoop RPC with RDMA over InfiniBand , 2013, 2013 42nd International Conference on Parallel Processing.

[20]  Christos Faloutsos,et al.  PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[21]  Chunjie Luo,et al.  BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking , 2013, WBDB.

[22]  Zhiwei Xu,et al.  Can MPI Benefit Hadoop and MapReduce Applications? , 2011, 2011 40th International Conference on Parallel Processing Workshops.

[23]  Dhabaleswar K. Panda,et al.  High performance RDMA-based design of HDFS over InfiniBand , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[24]  Weikuan Yu,et al.  Hadoop acceleration through network levitated merge , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[25]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[26]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[27]  M. Balazinska,et al.  A Study of Skew in MapReduce Applications , 2011 .

[28]  Dhabaleswar K. Panda,et al.  Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[29]  Archana Ganapathi,et al.  The Case for Evaluating MapReduce Performance Using Workload Suites , 2011, 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems.

[30]  Dhabaleswar K. Panda,et al.  High-Performance Design of YARN MapReduce on Modern HPC Clusters with Lustre and RDMA , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[31]  Xiaona Li,et al.  BigDataBench: a Big Data Benchmark Suite from Web Search Engines , 2013, ArXiv.

[32]  Seyong Lee,et al.  PUMA: Purdue MapReduce Benchmarks Suite , 2012 .