Accelerating Iterative Big Data Computing Through MPI

Current popular systems, Hadoop and Spark, cannot achieve satisfied performance because of the inefficient overlapping of computation and communication when running iterative big data applications. The pipeline of computing, data movement, and data management plays a key role for current distributed data computing systems. In this paper, we first analyze the overhead of shuffle operation in Hadoop and Spark when running PageRank workload, and then propose an event-driven pipeline and in-memory shuffle design with better overlapping of computation and communication as DataMPI-Iteration, an MPI-based library, for iterative big data computing. Our performance evaluation shows DataMPI-Iteration can achieve 9X∼21X speedup over Apache Hadoop, and 2X∼3X speedup over Apache Spark for PageRank and K-means.

[1]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[2]  Dhabaleswar K. Panda,et al.  High-Performance Design of Hadoop RPC with RDMA over InfiniBand , 2013, 2013 42nd International Conference on Parallel Processing.

[3]  Christos Faloutsos,et al.  PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[4]  Chen Feng,et al.  Performance Benefits of DataMPI: A Case Study with BigDataBench , 2014, BPOE@ASPLOS/VLDB.

[5]  Dhabaleswar K. Panda,et al.  HOMR: a hybrid approach to exploit maximum overlapping in MapReduce over high performance interconnects , 2014, ICS '14.

[6]  Dhabaleswar K. Panda,et al.  High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  Blair T. Johnson DSTAT : software for the meta-analytic review of research literatures , 1992 .

[9]  Jon M. Kleinberg,et al.  Group formation in large social networks: membership, growth, and evolution , 2006, KDD '06.

[10]  Jinyang Li,et al.  Piccolo: Building Fast, Distributed Programs with Partitioned Tables , 2010, OSDI.

[11]  Yanfeng Zhang,et al.  iMapReduce: A Distributed Computing Framework for Iterative Computation , 2011, Journal of Grid Computing.

[12]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[13]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[14]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[15]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[16]  Zhiwei Xu,et al.  DataMPI: Extending MPI to Hadoop-Like Big Data Computing , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[17]  Dhabaleswar K. Panda,et al.  Accelerating Spark with RDMA for Big Data Processing: Early Experiences , 2014, 2014 IEEE 22nd Annual Symposium on High-Performance Interconnects.

[18]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[19]  Jinyang Li,et al.  Building fast, distributed programs with partitioned tables , 2010 .

[20]  Yuqing Zhu,et al.  BigDataBench: A big data benchmark suite from internet services , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[21]  Chen Feng,et al.  Performance Characterization of Hadoop and Data MPI Based on Amdahl's Second Law , 2014, 2014 9th IEEE International Conference on Networking, Architecture, and Storage.

[22]  Chuck Lam,et al.  Hadoop in Action , 2010 .

[23]  Steven J. Plimpton,et al.  MapReduce in MPI for Large-scale graph algorithms , 2011, Parallel Comput..

[24]  Jure Leskovec,et al.  Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters , 2008, Internet Math..

[25]  Christos Faloutsos,et al.  Kronecker Graphs: An Approach to Modeling Networks , 2008, J. Mach. Learn. Res..

[26]  Zhiwei Xu,et al.  Can MPI Benefit Hadoop and MapReduce Applications? , 2011, 2011 40th International Conference on Parallel Processing Workshops.