A GPU implementation of inclusion-based points-to analysis

Graphics Processing Units (GPUs) have emerged as powerful accelerators for many regular algorithms that operate on dense arrays and matrices. In contrast, we know relatively little about using GPUs to accelerate highly irregular algorithms that operate on pointer-based data structures such as graphs. For the most part, research has focused on GPU implementations of graph analysis algorithms that do not modify the structure of the graph, such as algorithms for breadth-first search and strongly-connected components. In this paper, we describe a high-performance GPU implementation of an important graph algorithm used in compilers such as gcc and LLVM: Andersen-style inclusion-based points-to analysis. This algorithm is challenging to parallelize effectively on GPUs because it makes extensive modifications to the structure of the underlying graph and performs relatively little computation. In spite of this, our program, when executed on a 14 Streaming Multiprocessor GPU, achieves an average speedup of 7x compared to a sequential CPU implementation and outperforms a parallel implementation of the same algorithm running on 16 CPU cores. Our implementation provides general insights into how to produce high-performance GPU implementations of graph algorithms, and it highlights key differences between optimizing parallel programs for multicore CPUs and for GPUs.

[1]  Alexander Aiken,et al.  Partial online cycle elimination in inclusion constraint graphs , 1998, PLDI.

[2]  Ondrej Lhoták,et al.  Points-to analysis using BDDs , 2003, PLDI '03.

[3]  Olivier Tardieu,et al.  Ultra-fast aliasing analysis using CLA: a million lines of C code in a second , 2001, PLDI '01.

[4]  Keshav Pingali,et al.  An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-Body Algorithm , 2011 .

[5]  Pradeep Dubey,et al.  Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.

[6]  Lars Ole Andersen,et al.  Program Analysis and Specialization for the C Programming Language , 2005 .

[7]  Ulrik Brandes,et al.  Network Analysis: Methodological Foundations , 2010 .

[8]  Andrew S. Grimshaw,et al.  Scalable GPU graph traversal , 2012, PPoPP '12.

[9]  Keshav Pingali,et al.  Optimistic parallelism requires abstractions , 2007, PLDI '07.

[10]  Lubos Brim,et al.  Computing Strongly Connected Components in Parallel on CUDA , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[11]  Ondrej Lhoták,et al.  Scaling Java Points-to Analysis Using SPARK , 2003, CC.

[12]  Keshav Pingali,et al.  Synthesizing concurrent schedulers for irregular algorithms , 2011, ASPLOS XVI.

[13]  Edmond Chow,et al.  A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[14]  Keshav Pingali,et al.  Parallel inclusion-based points-to analysis , 2010, OOPSLA.

[15]  Bjarne Steensgaard,et al.  Points-to analysis in almost linear time , 1996, POPL '96.

[16]  L. Paul Chew,et al.  Guaranteed-quality mesh generation for curved surfaces , 1993, SCG '93.

[17]  Keshav Pingali,et al.  The tao of parallelism in algorithms , 2011, PLDI '11.

[18]  Kunle Olukotun,et al.  Accelerating CUDA graph algorithms at maximum warp , 2011, PPoPP '11.

[19]  Atanas Rountev,et al.  Off-line variable substitution for scaling points-to analysis , 2000, PLDI '00.

[20]  Fritz Henglein,et al.  Type inference and semi-unification , 1988, LISP and Functional Programming.

[21]  David A. Bader,et al.  Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2 , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[22]  Ulrik Brandes,et al.  Network Analysis: Methodological Foundations (Lecture Notes in Computer Science) , 2005 .

[23]  Kevin Skadron,et al.  A performance study of general-purpose applications on graphics processors using CUDA , 2008, J. Parallel Distributed Comput..

[24]  Monica S. Lam,et al.  Cloning-based context-sensitive pointer alias analysis using binary decision diagrams , 2004, PLDI '04.

[25]  Matthew Might,et al.  EigenCFA: accelerating flow analysis with GPUs , 2011, POPL '11.

[26]  Ben Hardekopf,et al.  The ant and the grasshopper: fast and accurate pointer analysis for millions of lines of code , 2007, PLDI '07.

[27]  Randal E. Bryant,et al.  Graph-Based Algorithms for Boolean Function Manipulation , 1986, IEEE Transactions on Computers.

[28]  Kunle Olukotun,et al.  Efficient Parallel Graph Exploration on Multi-Core CPU and GPU , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[29]  Song Huang,et al.  On the energy efficiency of graphics processing units for scientific computing , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[30]  Thomas W. Reps,et al.  Program analysis via graph reachability , 1997, Inf. Softw. Technol..

[31]  Murat Efe Guney,et al.  On the limits of GPU acceleration , 2010 .

[32]  Michael Hind,et al.  Pointer analysis: haven't we solved this problem yet? , 2001, PASTE '01.

[33]  Martin D. F. Wong,et al.  An effective GPU implementation of breadth-first search , 2010, Design Automation Conference.

[34]  P. J. Narayanan,et al.  Accelerating Large Graph Algorithms on the GPU Using CUDA , 2007, HiPC.