Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores

Interest has recently grown in efficiently analyzing unstructured data such as social network graphs and protein structures. A fundamental graph algorithm for doing such task is the Breadth-First Search (BFS) algorithm, the foundation for many other important graph algorithms such as calculating the shortest path or finding the maximum flow in graphs. In this paper, we share our experience of designing and implementing the BFS algorithm on Sunway TaihuLight, a newly released machine with 40,960 nodes and 10.6 million accelerator cores. It tops the Top500 list of June 2016 with a 93.01 petaflops Linpack performance [1]. Designed for extremely large-scale computation and power efficiency, processors on Sunway TaihuLight employ a unique heterogeneous many-core architecture and memory hierarchy. With its extremely large size, the machine provides both opportunities and challenges for implementing high-performance irregular algorithms, such as BFS. We propose several techniques, including pipelined module mapping, contention-free data shuffling, and group-based message batching, to address the challenges of efficiently utilizing the features of this large scale heterogeneous machine. We ultimately achieved 23755.7 giga-traversed edges per second (GTEPS), which is the best among heterogeneous machines and the second overall in the Graph500s June 2016 list [2].

[1]  Daisuke Takahashi,et al.  Efficient Hybrid Breadth-First Search on GPUs , 2013, ICA3PP.

[2]  Pradeep Dubey,et al.  Fast and Efficient Graph Traversal Algorithm for CPUs: Maximizing Single-Node Efficiency , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[3]  Fabio Checconi,et al.  Breaking the speed and scalability Barriers for Graph exploration on distributed-memory machines , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Kunle Olukotun,et al.  Efficient Parallel Graph Exploration on Multi-Core CPU and GPU , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[5]  Kunle Olukotun,et al.  Accelerating CUDA graph algorithms at maximum warp , 2011, PPoPP '11.

[6]  John D. Owens,et al.  Gunrock: a high-performance graph processing library on the GPU , 2015, PPoPP.

[7]  Yu Wang,et al.  A Reconfigurable Computing Approach for Efficient and Scalable Parallel Graph Exploration , 2012, 2012 IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors.

[8]  Martin D. F. Wong,et al.  An effective GPU implementation of breadth-first search , 2010, Design Automation Conference.

[9]  Edmond Chow,et al.  A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[10]  Nancy M. Amato,et al.  Faster Parallel Traversal of Scale Free Graphs at Extreme Scale with Vertex Delegates , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Kamesh Madduri,et al.  Parallel breadth-first search on distributed memory systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[12]  Nancy M. Amato,et al.  Scaling Techniques for Massive Scale-Free Graphs in Distributed (External) Memory , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[13]  Massimo Bernaschi,et al.  Parallel Distributed Breadth First Search on the Kepler Architecture , 2016, IEEE Transactions on Parallel and Distributed Systems.

[14]  Jack Dongarra,et al.  Report on the Sunway TaihuLight System , 2016 .

[15]  Fabrizio Petrini,et al.  Efficient Breadth-First Search on the Cell/BE Processor , 2008, IEEE Transactions on Parallel and Distributed Systems.

[16]  Katsuki Fujisawa,et al.  Fast and scalable NUMA-based thread parallel breadth-first search , 2015, 2015 International Conference on High Performance Computing & Simulation (HPCS).

[17]  H. Howie Huang,et al.  Enterprise: breadth-first graph traversal on GPUs , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  David A. Patterson,et al.  Distributed Memory Breadth-First Search Revisited: Enabling Bottom-Up Search , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[19]  Fabio Checconi,et al.  Traversing Trillions of Edges in Real Time: Graph Exploration on Large-Scale Parallel Machines , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[20]  Richard E. Korf,et al.  Large-Scale Parallel Breadth-First Search , 2005, AAAI.

[21]  Koji Ueno,et al.  Parallel distributed breadth first search on GPU , 2013, 20th Annual International Conference on High Performance Computing.

[22]  David A. Patterson,et al.  Distributed-Memory Breadth-First Search on Massive Graphs , 2017, ArXiv.

[23]  Wei Ge,et al.  The Sunway TaihuLight supercomputer: system and applications , 2016, Science China Information Sciences.

[24]  Andrew S. Grimshaw,et al.  Scalable GPU graph traversal , 2012, PPoPP '12.

[25]  David A. Bader,et al.  Scalable Graph Exploration on Multicore Processors , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[26]  David A. Patterson,et al.  Direction-optimizing breadth-first search , 2012, HiPC 2012.