On the Efficacy of APUs for Heterogeneous Graph Computation

Accelerated Processing Units (APUs) are central processor s that feature integrated GPU cores. In this study, we show that thi s architecture is well-suited to the domain of graph analysis. O ur evaluation shows that a current-generation integrated GPU can o utperform an externally-connected discrete GPU by up to 50% for the breadth-first search and PageRank algorithms. Furthermore , by operating on data with different characteristics in unison, t he CPU and integrated GPU can halve the running time of PageRank on a scale-free dataset.

[1]  P. Erdos,et al.  On the evolution of random graphs , 1984 .

[2]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[3]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[4]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[5]  Christos Faloutsos,et al.  R-MAT: A Recursive Model for Graph Mining , 2004, SDM.

[6]  David R. Kaeli,et al.  Exploring the multiple-GPU design space , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[7]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[8]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[9]  David A. Bader,et al.  Scalable Graph Exploration on Multicore Processors , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Martin D. F. Wong,et al.  An effective GPU implementation of breadth-first search , 2010, Design Automation Conference.

[11]  Brian W. Barrett,et al.  Introducing the Graph 500 , 2010 .

[12]  Kunle Olukotun,et al.  Accelerating CUDA graph algorithms at maximum warp , 2011, PPoPP '11.

[13]  Jeffrey S. Vetter,et al.  Quantifying NUMA and contention effects in multi-GPU systems , 2011, GPGPU-4.

[14]  Jungwon Kim,et al.  Achieving a single compute device image in OpenCL for multiple GPUs , 2011, PPoPP '11.

[15]  Kunle Olukotun,et al.  Efficient Parallel Graph Exploration on Multi-Core CPU and GPU , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[16]  Pradeep Dubey,et al.  Fast and Efficient Graph Traversal Algorithm for CPUs: Maximizing Single-Node Efficiency , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[17]  Andrew S. Grimshaw,et al.  Scalable GPU graph traversal , 2012, PPoPP '12.

[18]  Guy E. Blelloch,et al.  GraphChi: Large-Scale Graph Computation on Just a PC , 2012, OSDI.

[19]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[20]  Guy E. Blelloch,et al.  Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.

[21]  Keshav Pingali,et al.  A lightweight infrastructure for graph analytics , 2013, SOSP.

[22]  Michela Becchi,et al.  Deploying Graph Algorithms on GPUs: An Adaptive Solution , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[23]  Jean-Philippe Martin,et al.  Dandelion: a compiler and runtime for heterogeneous systems , 2013, SOSP.

[24]  David A. Wood,et al.  Heterogeneous system coherence for integrated CPU-GPU systems , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[25]  Kevin Skadron,et al.  Pannotia: Understanding irregular GPGPU graph applications , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).