Optimizing communication for a 2D-partitioned scalable BFS

Recent research projects have investigated partitioning, acceleration, and data reduction techniques for improving the performance of Breadth First Search (BFS) and the related HPC benchmark, Graph500. However, few implementations have focused on cloud-based systems like Amazon's Web Services, which differ from HPC systems in several ways, most importantly in terms of network interconnect. This work looks at optimizations to reduce the communication overhead of an accelerated, distributed BFS on an HPC system and a smaller cloud-like system that contains GPUs. We demonstrate the effects of an efficient 2D partitioning scheme and allreduce implementation, as well as different CPU-based compression schemes for reducing the overall amount of data shared between nodes. Timing and Score-P profiling results demonstrate a dramatic reduction in row and column frontier queue data (up to 91%) and show how compression can improve performance for a bandwidth-limited cluster.

[1]  Katsuki Fujisawa,et al.  Fast and scalable NUMA-based thread parallel breadth-first search , 2015, 2015 International Conference on High Performance Computing & Simulation (HPCS).

[2]  Fabio Checconi,et al.  Exploring network optimizations for large-scale graph analytics , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Jonathan Goldstein,et al.  Compressing relations and indexes , 1998, Proceedings 14th International Conference on Data Engineering.

[4]  Gang Wang,et al.  Efficient Parallel Lists Intersection and Index Compression Algorithms using Graphics Processing Units , 2011, Proc. VLDB Endow..

[5]  Mingyu Chen,et al.  Compression and Sieve: Reducing Communication in Parallel Breadth First Search on Distributed Memory Systems , 2012, ArXiv.

[6]  Koji Ueno,et al.  2D Partitioning Based Graph Search for the Graph500 Benchmark , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[7]  Kunle Olukotun,et al.  Accelerating CUDA graph algorithms at maximum warp , 2011, PPoPP '11.

[8]  Marcin Zukowski,et al.  Super-Scalar RAM-CPU Cache Compression , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[9]  Leonid Boytsov,et al.  SIMD compression and the intersection of sorted integers , 2014, Softw. Pract. Exp..

[10]  Massimo Bernaschi,et al.  Efficient breadth first search on multi-GPU systems , 2013, J. Parallel Distributed Comput..

[11]  Chinya V. Ravishankar,et al.  Block-Oriented Compression Techniques for Large Statistical Databases , 1997, IEEE Trans. Knowl. Data Eng..

[12]  David A. Patterson,et al.  Direction-optimizing Breadth-First Search , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Tong Liu,et al.  The development of Mellanox/NVIDIA GPUDirect over InfiniBand—a new model for GPU to GPU communications , 2011, Computer Science - Research and Development.

[14]  Nancy M. Amato,et al.  Scaling Techniques for Massive Scale-Free Graphs in Distributed (External) Memory , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[15]  Julian Romera,et al.  Optimizing Communication by Compression for Multi-GPU Scalable Breadth-First Searches , 2017, ArXiv.

[16]  M. Delignette-Muller,et al.  fitdistrplus: An R Package for Fitting Distributions , 2015 .

[17]  M. Żukowski,et al.  Balancing vectorized query execution with bandwidth-optimized storage , 2009 .

[18]  Massimo Bernaschi,et al.  Parallel Distributed Breadth First Search on the Kepler Architecture , 2016, IEEE Transactions on Parallel and Distributed Systems.

[19]  Leonid Boytsov,et al.  Decoding billions of integers per second through vectorization , 2012, Softw. Pract. Exp..

[20]  Daniel Lemire,et al.  Vectorized VByte Decoding , 2015, ArXiv.

[21]  Fabio Checconi,et al.  Massive data analytics: The Graph 500 on IBM Blue Gene/Q , 2013, IBM J. Res. Dev..

[22]  Alexander A. Stepanov,et al.  SIMD-based decoding of posting lists , 2011, CIKM '11.

[23]  Koji Ueno,et al.  Highly scalable graph search for the Graph500 benchmark , 2012, HPDC '12.

[24]  Guy E. Blelloch,et al.  Smaller and Faster: Parallel Processing of Compressed Graphs with Ligra+ , 2015, 2015 Data Compression Conference.

[25]  Dirk Schmidl,et al.  Score-P: A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir , 2011, Parallel Tools Workshop.

[26]  Ahmad Afsahi,et al.  GPU-Aware Intranode MPI_Allreduce , 2014, EuroMPI/ASIA.

[27]  Martin D. F. Wong,et al.  An effective GPU implementation of breadth-first search , 2010, Design Automation Conference.

[28]  Massimo Bernaschi,et al.  Breadth First Search on APEnet+ , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[29]  Koji Ueno,et al.  Parallel distributed breadth first search on GPU , 2013, 20th Annual International Conference on High Performance Computing.

[30]  Kamesh Madduri,et al.  Parallel breadth-first search on distributed memory systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[31]  Andrew S. Grimshaw,et al.  Scalable GPU graph traversal , 2012, PPoPP '12.

[32]  Dhabaleswar K. Panda,et al.  Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs , 2013, 2013 42nd International Conference on Parallel Processing.