A Tuned, Concurrent-Kernel Approach to Speed Up the APSP Problem

The All-Pair Shortest-Path (APSP) problem is a well-known problem in graph theory whose objective is to nd the shortest paths between any pair of nodes. Computing the distances from one source node to the rest and repeating this process for every node of the graph is an adequate solution for sparse graphs. During the last years the application of GPU devices have increased to accelerate this kind of problems. While the correctness of an NVIDIA CUDA implementation of this algorithm is easy to achieve, exploiting the GPU capabilities to obtain a good performance is a task for CUDA experienced programmers. A typical code tuning strategy is the selection of an appropriate threadBlocks size. Besides this, the concurrent deployment of several kernels that computes distances from dierent sources, also accelerates the execution times. In this paper we show that an adequate combination of both strategies represents a 11.5 % performance improvement between dierent, recommended CUDA congurations for the most costly kernel of the APSP problem.

[1]  Johan Pouwelse,et al.  Efficient Approximate Computation of Betweenness Centrality , 2010 .

[2]  José D. P. Rolim,et al.  Brief announcement: routing with obstacle avoidance mechanism with constant approximation ratio , 2010, PODC.

[3]  Jack Dongarra,et al.  Computational Science – ICCS 2009: 9th International Conference Baton Rouge, LA, USA, May 25-27, 2009 Proceedings, Part I , 2009, ICCS.

[4]  Pedro J. Martín,et al.  CUDA Solutions for the SSSP Problem , 2009, ICCS.

[5]  Jaume Barceló,et al.  Microscopic traffic simulation: A tool for the design, analysis and evaluation of intelligent transport systems , 2005, J. Intell. Robotic Syst..

[6]  Josef Stoer,et al.  Numerische Mathematik 1 , 1989 .

[7]  Arturo González-Escribano,et al.  uBench: exposing the impact of CUDA block geometry in terms of performance , 2013, The Journal of Supercomputing.

[8]  Hong Cheng,et al.  The exact distance to destination in undirected world , 2012, The VLDB Journal.

[9]  P. J. Narayanan,et al.  Large Graph Algorithms for Massively Multithreaded Architectures , 2009 .

[10]  Jie Cheng,et al.  Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[11]  Tibor Cinkler,et al.  On shortest path representation , 2007, IEEE/ACM Trans. Netw..

[12]  Arturo González-Escribano,et al.  A new GPU-based approach to the Shortest Path problem , 2013, HPCS.

[13]  G. C. D. Verdière Introduction to GPGPU, a hardware and software background , 2011 .

[14]  Arturo González-Escribano,et al.  Using Fermi Architecture Knowledge to Speed up CUDA and OpenCL Programs , 2012, 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications.

[15]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[16]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[17]  Kurt Mehlhorn,et al.  A Parallelization of Dijkstra's Shortest Path Algorithm , 1998, MFCS.

[18]  Stephen Warshall,et al.  A Theorem on Boolean Matrices , 1962, JACM.