Exploiting social network graph characteristics for efficient BFS on heterogeneous chips

Abstract Several approaches implement efficient BFS algorithms for multicores and for GPUs. However, when targeting heterogeneous architectures, it is still an open problem how to distribute the work among the CPU cores and the accelerators. In this paper, we assess several approaches to perform BFS on different heterogeneous chips (a multicore CPU and an integrated GPU). In particular, we propose three heterogeneous approaches that exploit the collaboration between both devices: Selective, Concurrent and Asynchronous. We identify how to take advantage of the features of social network graphs, that are a particular example of highly connected graphs-with fewer iterations and more unbalanced-, as well as the drawbacks of each algorithmic implementation. One key feature of our approaches is that they switch between different versions of the algorithm, depending on the device that collaborates in the computation. Through exhaustive evaluation we find that our heterogeneous implementations can be up to 1.56 ×  faster and 1.32 ×  more energy efficient with respect to the best baseline where only one device is used, being the overhead w.r.t. an oracle scheduler below 10%. We also compare with other related heterogeneous approach finding that ours can be up to 3.6 ×  faster.

[1]  Kunle Olukotun,et al.  Efficient Parallel Graph Exploration on Multi-Core CPU and GPU , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[2]  Rafael Asenjo,et al.  Breadth-First Search on Heterogeneous Platforms: A Case of Study on Social Networks , 2016, 2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).

[3]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[4]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[5]  Daisuke Takahashi,et al.  Efficient Hybrid Breadth-First Search on GPUs , 2013, ICA3PP.

[6]  John R. Gilbert,et al.  Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks , 2009, SPAA '09.

[7]  David A. Bader,et al.  Task-based parallel breadth-first search in heterogeneous environments , 2012, 2012 19th International Conference on High Performance Computing.

[8]  Aaftab Munshi,et al.  The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[9]  Antonio J. Peña,et al.  Chai: Collaborative heterogeneous applications for integrated-architectures , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[10]  David A. Patterson,et al.  Direction-optimizing breadth-first search , 2012, HiPC 2012.

[11]  Stanley Milgram,et al.  An Experimental Study of the Small World Problem , 1969 .

[12]  Andrew S. Grimshaw,et al.  Revisiting sorting for GPGPU stream architectures , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[13]  Mitesh R. Meswani,et al.  Efficient breadth-first search on a heterogeneous processor , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[14]  Martin D. F. Wong,et al.  An effective GPU implementation of breadth-first search , 2010, Design Automation Conference.