Evaluation and Optimization of Breadth-First Search on NUMA Cluster

Graph is widely used in many areas. Breadth-First Search (BFS), a key subroutine for many graph analysis algorithms, has become the primary benchmark for Graph500 ranking. Due to the high communication cost of BFS, multi-socket nodes with large memory capacity (NUMA) are supposed to reduce network pressure. However, the longer latency to remote memory may cause problem if not treated well. In this work, we first demonstrate that simply spawning and binding one MPI process for each socket can achieve the best performance for MPI/OpenMP hybrid programmed BFS algorithm, resulting in 1.53X of performance on 16 nodes. Nevertheless, we notice that one MPI process per socket may exacerbate the communication cost. We propose to share some communication data structure among the processes inside the same node, to eliminate most of the intra-node communication. To fully utilize the network bandwidth, we make all the processes in a node to perform communication simultaneously. We further adjust the granularity of a key bitmap for better cache locality to speed up the computation. With all the optimizations for NUMA, communication and computation together, 2.44X of performance is achieved on 16 nodes, which is 39.2 Billion Traversed Edges per Second for an R-MAT graph of scale 32 (4 billion vertices and 64 billion edges).

[1]  Dhabaleswar K. Panda,et al.  Designing High Performance and Scalable MPI Intra-node Communication Support for Clusters , 2006, 2006 IEEE International Conference on Cluster Computing.

[2]  Dhabaleswar K. Panda,et al.  Understanding the Impact of Multi-Core Architecture in Cluster Computing: A Case Study with Intel Dual-Core System , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[3]  Dhabaleswar K. Panda,et al.  Designing topology-aware collective communication algorithms for large scale InfiniBand clusters: Case studies with Scatter and Gather , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[4]  Thomas R. Gross,et al.  Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead , 2011, ISMM '11.

[5]  Galen M. Shipman,et al.  MPI Support for Multi-core Architectures: Optimized Shared Memory Collectives , 2008, PVM/MPI.

[6]  Yinglong Xia TOPOLOGICALLY ADAPTIVE PARALLEL BREADTH-FIRST SEARCH ON MULTICORE PROCESSORS , 2010 .

[7]  Kunle Olukotun,et al.  Efficient Parallel Graph Exploration on Multi-Core CPU and GPU , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[8]  Thomas Hérault,et al.  MPI Applications on Grids: A Topology Aware Approach , 2009, Euro-Par.

[9]  Christos Faloutsos,et al.  R-MAT: A Recursive Model for Graph Mining , 2004, SDM.

[10]  George Bosilca,et al.  Kernel Assisted Collective Intra-node MPI Communication among Multi-Core and Many-Core CPUs , 2011, 2011 International Conference on Parallel Processing.

[11]  George Bosilca,et al.  HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[12]  Dhabaleswar K. Panda,et al.  Designing multi-leader-based Allgather algorithms for multi-core clusters , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[13]  Rajeev Thakur,et al.  Improving the Performance of Collective Operations in MPICH , 2003, PVM/MPI.

[14]  Amith R. Mamidala,et al.  Efficient Shared Memory and RDMA Based Design for MPI_Allgather over InfiniBand , 2006, PVM/MPI.

[15]  Thomas R. Gross,et al.  Matching memory access patterns and data placement for NUMA systems , 2012, CGO '12.

[16]  Mitsuhisa Sato,et al.  Performance Evaluation of OpenMP and MPI Hybrid Programs on a Large Scale Multi-core Multi-socket Cluster, T2K Open Supercomputer , 2009, 2009 International Conference on Parallel Processing Workshops.

[17]  D. Patterson,et al.  Searching for a Parent Instead of Fighting Over Children : A Fast Breadth-First Search Implementation for Graph 500 , 2011 .

[18]  Torsten Hoefler,et al.  Implementation and performance analysis of non-blocking collective operations for MPI , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[19]  Thomas R. Gross,et al.  Memory system performance in a NUMA multicore multiprocessor , 2011, SYSTOR '11.

[20]  Matthias S. Müller,et al.  Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[21]  Kunle Olukotun,et al.  Accelerating CUDA graph algorithms at maximum warp , 2011, PPoPP '11.

[22]  Jesper Larsson Träff Implementing the MPI process topology mechanism , 2002, SC '02.

[23]  Andrew S. Grimshaw,et al.  Scalable GPU graph traversal , 2012, PPoPP '12.

[24]  David A. Bader,et al.  Scalable Graph Exploration on Multicore Processors , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[25]  Martin D. F. Wong,et al.  An effective GPU implementation of breadth-first search , 2010, Design Automation Conference.

[26]  David Mizell,et al.  Early experiences with large-scale Cray XMT systems , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[27]  Satoshi Matsuoka,et al.  Performance characteristics of Graph500 on large-scale distributed environment , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[28]  Thomas Hérault,et al.  Process Distance-Aware Adaptive MPI Collective Communications , 2011, 2011 IEEE International Conference on Cluster Computing.

[29]  Kamesh Madduri,et al.  Parallel breadth-first search on distributed memory systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[30]  Jesper Larsson Träff,et al.  Efficient Allgather for Regular SMP-Clusters , 2006, PVM/MPI.

[31]  Alexandra Fedorova,et al.  A case for NUMA-aware contention management on multicore systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[32]  R. Vanderwijngaart,et al.  NAS Parallel Benchmarks, Multi-Zone Versions , 2003 .

[33]  Wenguang Chen,et al.  MPIPP: an automatic profile-guided parallel process placement toolset for SMP clusters and multiclusters , 2006, ICS '06.

[34]  Georg Hager,et al.  Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[35]  Collin McCurdy,et al.  Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[36]  Xingfu Wu,et al.  Performance characteristics of hybrid MPI/OpenMP implementations of NAS parallel benchmarks SP and BT on large-scale multicore supercomputers , 2011, PERV.

[37]  William Gropp,et al.  Faster topology-aware collective algorithms through non-minimal communication , 2012, PPoPP '12.

[38]  Ronak Singhal,et al.  Inside Intel® Core microarchitecture (Nehalem) , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[39]  Emmanuel Jeannot,et al.  Near-Optimal Placement of MPI Processes on Hierarchical NUMA Architectures , 2010, Euro-Par.

[40]  Edmond Chow,et al.  A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[41]  PetriniFabrizio,et al.  Efficient Breadth-First Search on the Cell/BE Processor , 2008 .

[42]  David A. Bader,et al.  Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2 , 2006, 2006 International Conference on Parallel Processing (ICPP'06).