High-Performance Graph Analytics on Manycore Processors

The divergence in the computer architecture landscape has resulted in different architectures being considered mainstream at the same time. For application and algorithm developers, a dilemma arises when one must focus on using underlying architectural features to extract the best performance on each of these architectures, while writing portable code at the same time. We focus on this problem with graph analytics as our target application domain. In this paper, we present an abstraction-based methodology for performance-portable graph algorithm design on manicure architectures. We demonstrate our approach by systematically optimizing algorithms for the problems of breadth-first search, color propagation, and strongly connected components. We use Kokkos, a manicure library and programming model, for prototyping our algorithms. Our portable implementation of the strongly connected components algorithm on the NVIDIA Tesla K40M is up to 3.25× faster than a state-of-the-art parallel CPU implementation on a dual-socket Sandy Bridge compute node.

[1]  Réka Albert,et al.  Near linear time algorithm to detect community structures in large-scale networks. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[2]  Jonathan W. Berry,et al.  Software and Algorithms for Graph Queries on Multithreaded Architectures , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[3]  Michael Garland,et al.  Work-Efficient Parallel GPU Methods for Single-Source Shortest Paths , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[4]  Keshav Pingali,et al.  Optimistic parallelism requires abstractions , 2007, PLDI '07.

[5]  Andrew S. Grimshaw,et al.  Scalable GPU graph traversal , 2012, PPoPP '12.

[6]  Daniel Sunderland,et al.  Kokkos: Enabling manycore performance portability through polymorphic memory access patterns , 2014, J. Parallel Distributed Comput..

[7]  Matei Ripeanu,et al.  On Graphs, GPUs, and Blind Dating: A Workload to Processor Matchmaking Quest , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[8]  Alex Pothen,et al.  Computing the block triangular form of a sparse matrix , 1990, TOMS.

[9]  Kunle Olukotun,et al.  On fast parallel detection of strongly connected components (SCC) in small-world graphs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[10]  Fabio Checconi,et al.  Traversing Trillions of Edges in Real Time: Graph Exploration on Large-Scale Parallel Machines , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[11]  Kunle Olukotun,et al.  Efficient Parallel Graph Exploration on Multi-Core CPU and GPU , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[12]  David A. Bader,et al.  Scalable and High Performance Betweenness Centrality on the GPU , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[14]  Sung-Eun Choi,et al.  Optimizing Loop-level Parallelism in Cray XMT TM Applications , 2009 .

[15]  Kunle Olukotun,et al.  Accelerating CUDA graph algorithms at maximum warp , 2011, PPoPP '11.

[16]  Fabio Checconi,et al.  Scalable Single Source Shortest Path Algorithms for Massively Parallel Systems , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[17]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[18]  Lawrence Rauchwerger,et al.  Identifying Strongly Connected Components in Parallel , 2000, PPSC.

[19]  Guy E. Blelloch,et al.  Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.

[20]  Walter A. Kosters,et al.  Fast Diameter Computation of Large Sparse Graphs Using GPUs , 2014, 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[21]  Sivasankaran Rajamanickam,et al.  PuLP: Scalable multi-objective multi-constraint partitioning for small-world networks , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[22]  David A. Patterson,et al.  Direction-optimizing Breadth-First Search , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[23]  David F. Bacon,et al.  Compiler transformations for high-performance computing , 1994, CSUR.

[24]  Francesco De Pellegrini,et al.  General , 1895, The Social History of Alcohol Review.

[25]  Christian Staudt,et al.  NetworKit: An Interactive Tool Suite for High-Performance Network Analysis , 2014, ArXiv.

[26]  Krishna P. Gummadi,et al.  Measurement and analysis of online social networks , 2007, IMC '07.

[27]  Jianlong Zhong,et al.  Medusa: Simplified Graph Processing on GPUs , 2014, IEEE Transactions on Parallel and Distributed Systems.

[28]  Kishore Kothapalli,et al.  Work efficient parallel algorithms for large graph exploration , 2013, 20th Annual International Conference on High Performance Computing.

[29]  Jérôme Kunegis,et al.  KONECT: the Koblenz network collection , 2013, WWW.

[30]  Keshav Pingali,et al.  A lightweight infrastructure for graph analytics , 2013, SOSP.

[31]  Lawrence Rauchwerger,et al.  Finding strongly connected components in distributed graphs , 2005, J. Parallel Distributed Comput..

[32]  Keshav Pingali,et al.  A quantitative study of irregular programs on GPUs , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[33]  Kamesh Madduri,et al.  Simple parallel biconnectivity algorithms for multicore platforms , 2014, 2014 21st International Conference on High Performance Computing (HiPC).

[34]  Andrew Lumsdaine,et al.  Lifting sequential graph algorithms for distributed-memory parallel computation , 2005, OOPSLA '05.

[35]  Sivasankaran Rajamanickam,et al.  BFS and Coloring-Based Parallel Algorithms for Strongly Connected Components and Related Problems , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[36]  Tao Gao,et al.  Using the Intel Many Integrated Core to accelerate graph traversal , 2014, Int. J. High Perform. Comput. Appl..