Accelerating inclusion-based pointer analysis on heterogeneous CPU-GPU systems

This paper describes the first implementation of Andersen's inclusion-based pointer analysis for C programs on a heterogeneous CPU-GPU system, where both its CPU and GPU cores are used. As an important graph algorithm, Andersen's analysis is difficult to parallelise because it makes extensive modifications to the structure of the underlying graph, in a way that is highly input-dependent and statically hard to analyse. Existing parallel solutions run on either the CPU or GPU but not both, rendering the underlying computational resources underutilised and the ratios of CPU-only over GPU-only speedups for certain programs (i.e., graphs) unpredictable. We observe that a naive parallel solution of Andersen's analysis on a CPU-GPU system suffers from poor performance due to workload imbalance. We introduce a solution that is centered around a new dynamic workload distribution scheme. The novelty lies in prioritising the distribution of different types of workloads, i.e., graph-rewriting rules in Andersen's analysis to CPU or GPU according to the degrees of the processing unit's suitability for processing them. This scheme is effective when combined with synchronisation-free execution of tasks (i.e., graph-rewriting rules) and difference propagation of points-to information between the CPU and GPU. For a set of seven C benchmarks evaluated, our CPU-GPU solution outperforms (on average) (1) the CPU-only solution by 50.6%, (2) the GPU-only solution by 78.5%, and (3) an oracle solution that behaves as the faster of (1) and (2) on every benchmark by 34.6%.

[1]  Ondrej Lhoták,et al.  Pick your contexts well: understanding object-sensitivity , 2011, POPL '11.

[2]  Jingling Xue,et al.  Model-Driven Tile Size Selection for DOACROSS Loops on GPUs , 2011, Euro-Par.

[3]  Hongtao Yu,et al.  Level by level: making flow- and context-sensitive pointer analysis scalable for millions of lines of code , 2010, CGO '10.

[4]  Olivier Tardieu,et al.  Ultra-fast aliasing analysis using CLA: a million lines of C code in a second , 2001, PLDI '01.

[5]  Jingling Xue,et al.  Static memory leak detection using full-sparse value-flow analysis , 2012, ISSTA 2012.

[6]  Yi Lu,et al.  An Incremental Points-to Analysis with CFL-Reachability , 2013, CC.

[7]  Rupesh Nasre,et al.  Parallel Replication-Based Points-To Analysis , 2012, CC.

[8]  Hui Wu,et al.  Parallelizing SOR for GPGPUs using alternate loop tiling , 2012, Parallel Comput..

[9]  Monica S. Lam,et al.  An Efficient Inclusion-Based Points-To Analysis for Strictly-Typed Languages , 2002, SAS.

[10]  Pradeep Dubey,et al.  Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.

[11]  Lars Ole Andersen,et al.  Program Analysis and Specialization for the C Programming Language , 2005 .

[12]  Lubos Brim,et al.  Computing Strongly Connected Components in Parallel on CUDA , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[13]  David A. Bader,et al.  Task-based parallel breadth-first search in heterogeneous environments , 2012, 2012 19th International Conference on High Performance Computing.

[14]  Kunle Olukotun,et al.  Accelerating CUDA graph algorithms at maximum warp , 2011, PPoPP '11.

[15]  Helmut Seidl,et al.  Propagating Differences: An Efficient New Fixpoint Algorithm for Distributive Constraint Systems , 1998, Nord. J. Comput..

[16]  Yi Lu,et al.  Fast and precise points-to analysis with incremental CFL-reachability summarisation: preliminary experience , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[17]  Matei Ripeanu,et al.  On Graphs, GPUs, and Blind Dating: A Workload to Processor Matchmaking Quest , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[18]  Ben Hardekopf,et al.  The ant and the grasshopper: fast and accurate pointer analysis for millions of lines of code , 2007, PLDI '07.

[19]  P. J. Narayanan,et al.  Accelerating Large Graph Algorithms on the GPU Using CUDA , 2007, HiPC.

[20]  Atanas Rountev,et al.  Merging equivalent contexts for scalable heap-cloning-based context-sensitive points-to analysis , 2008, ISSTA '08.

[21]  Keshav Pingali,et al.  Parallel inclusion-based points-to analysis , 2010, OOPSLA.

[22]  Matei Ripeanu,et al.  A yoke of oxen and a thousand chickens for heavy lifting graph processing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[23]  Yi Yang,et al.  A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.

[24]  Uday Bondhugula,et al.  A compiler framework for optimization of affine loop nests for gpgpus , 2008, ICS '08.

[25]  Charles Zhang,et al.  Geometric encoding: forging the high performance context sensitive points-to analysis for Java , 2011, ISSTA '11.

[26]  Jingling Xue,et al.  Automatic Parallelization of Tiled Loop Nests with Enhanced Fine-Grained Parallelism on GPUs , 2012, 2012 41st International Conference on Parallel Processing.

[27]  Chris Hankin,et al.  Online cycle detection and difference propagation for pointer analysis , 2003, Proceedings Third IEEE International Workshop on Source Code Analysis and Manipulation.

[28]  Margaret Martonosi,et al.  Reducing GPU offload latency via fine-grained CPU-GPU synchronization , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[29]  Jie Zhang,et al.  Making context‐sensitive inclusion‐based pointer analysis practical for compilers using parameterised summarisation , 2014, Softw. Pract. Exp..

[30]  Kunle Olukotun,et al.  Efficient Parallel Graph Exploration on Multi-Core CPU and GPU , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[31]  Richard W. Vuduc,et al.  Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems , 2009, ICS.

[32]  Lian Li,et al.  Boosting the performance of flow-sensitive points-to analysis using value flow , 2011, ESEC/FSE '11.

[33]  Hui Wu,et al.  Toward Harnessing DOACROSS Parallelism for Multi-GPGPUs , 2010, 2010 39th International Conference on Parallel Processing.

[34]  Jingling Xue,et al.  Query-directed adaptive heap cloning for optimizing compilers , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[35]  David A. Bader,et al.  Scalable Graph Exploration on Multicore Processors , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[36]  Keshav Pingali,et al.  A GPU implementation of inclusion-based points-to analysis , 2012, PPoPP '12.

[37]  Manu Sridharan,et al.  The Complexity of Andersen's Analysis in Practice , 2009, SAS.

[38]  Jingling Xue,et al.  On-demand dynamic summary-based points-to analysis , 2012, CGO '12.

[39]  Matthew Might,et al.  EigenCFA: accelerating flow analysis with GPUs , 2011, POPL '11.

[40]  Jingling Xue,et al.  SPAS: Scalable Path-Sensitive Pointer Analysis on Full-Sparse SSA , 2011, APLAS.

[41]  Fernando Magno Quintão Pereira,et al.  Wave Propagation and Deep Propagation for Pointer Analysis , 2009, 2009 International Symposium on Code Generation and Optimization.

[42]  R. Govindarajan,et al.  Prioritizing constraint evaluation for efficient points-to analysis , 2011, International Symposium on Code Generation and Optimization (CGO 2011).