Extreme scale breadth-first search on supercomputers

Breadth-First Search(BFS) is one of the most fundamental graph algorithms used as a component of many graph algorithms. Our new method for distributed parallel BFS can compute BFS for one trillion vertices graph within half a second, using large supercomputers such as the K-Computer. By the use of our proposed algorithm, the K-Computer was ranked 1st in Graph500 using all the 82,944 nodes available on June and November 2015 and June 2016 38,621.4 GTEPS. Based on the hybrid-BFS algorithm by Beamer[3], we devise sets of optimizations for scaling to extreme number of nodes, including a new efficient graph data structure and optimization techniques such as vertex reordering and load balancing. Performance evaluation on the K shows our new BFS is 3.19 times faster on 30,720 nodes than the base version using the previously-known best techniques.

[1]  Koji Ueno,et al.  Parallel distributed breadth first search on GPU , 2013, 20th Annual International Conference on High Performance Computing.

[2]  Koji Ueno,et al.  Highly scalable graph search for the Graph500 benchmark , 2012, HPDC '12.

[3]  David A. Patterson,et al.  Direction-optimizing breadth-first search , 2012, HiPC 2012.

[4]  Christos Faloutsos,et al.  Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication , 2005, PKDD.

[5]  Edmond Chow,et al.  A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[6]  Katsuki Fujisawa,et al.  Fast and Energy-efficient Breadth-First Search on a Single NUMA System , 2014, ISC.

[7]  David A. Patterson,et al.  Distributed Memory Breadth-First Search Revisited: Enabling Bottom-Up Search , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[8]  Fabio Checconi,et al.  Traversing Trillions of Edges in Real Time: Graph Exploration on Large-Scale Parallel Machines , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[9]  Pradeep Dubey,et al.  Large-scale energy-efficient graph traversal: A path to efficient data-intensive supercomputing , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  John R. Gilbert,et al.  On the representation and multiplication of hypersparse matrices , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[11]  Eurípides Montagne,et al.  An optimal storage format for sparse matrices , 2004, Inf. Process. Lett..

[12]  Tomohiro Inoue,et al.  The Tofu Interconnect , 2011, 2011 IEEE 19th Annual Symposium on High Performance Interconnects.

[13]  Fabio Checconi,et al.  Breaking the speed and scalability Barriers for Graph exploration on distributed-memory machines , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  David A. Patterson,et al.  Direction-optimizing Breadth-First Search , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Satoshi Matsuoka,et al.  Performance characteristics of Graph500 on large-scale distributed environment , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[16]  Kamesh Madduri,et al.  Parallel breadth-first search on distributed memory systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[17]  Tomohiro Inoue,et al.  The Tofu Interconnect , 2012, IEEE Micro.

[18]  Fumiyoshi Shoji,et al.  The K computer: Japanese next-generation supercomputer development project , 2011, IEEE/ACM International Symposium on Low Power Electronics and Design.