Loop-Oriented Pointer Analysis for Automatic SIMD Vectorization

Compiler-based vectorization represents a promising solution to automatically generate code that makes efficient use of modern CPUs with SIMD extensions. Two main auto-vectorization techniques, superword-level parallelism vectorization (SLP) and loop-level vectorization (LLV), require precise dependence analysis on arrays and structs to vectorize isomorphic scalar instructions (in the case of SLP) and reduce dynamic dependence checks at runtime (in the case of LLV). The alias analyses used in modern vectorizing compilers are either intra-procedural (without tracking inter-procedural data-flows) or inter-procedural (by using field-sensitive models, which are too imprecise in handling arrays and structs). This article proposes an inter-procedural Loop-oriented Pointer Analysis for C, called Lpa, for analyzing arrays and structs to support aggressive SLP and LLV optimizations effectively. Unlike field-insensitive solutions that pre-allocate objects for each memory allocation site, our approach uses a lazy memory model to generate access-based location sets based on how structs and arrays are accessed. Lpa can precisely analyze arrays and nested aggregate structures to enable SIMD optimizations for large programs. By separating the location set generation as an independent concern from the rest of the pointer analysis, Lpa is designed so that existing points-to resolution algorithms (e.g., flow-insensitive and flow-sensitive pointer analysis) can be reused easily. We have implemented Lpa fully in the LLVM compiler infrastructure (version 3.8.0). We evaluate Lpa by considering SLP and LLV, the two classic vectorization techniques, on a set of 20 C and Fortran CPU2000/2006 benchmarks. For SLP, Lpa outperforms LLVM’s BasicAA and ScevAA by discovering 139 and 273 more vectorizable basic blocks, respectively, resulting in the best speedup of 2.95% for 173.applu. For LLV, LLVM introduces totally 551 and 652 static bound checks under BasicAA and ScevAA, respectively. In contrast, Lpa has reduced these static checks to 220, with an average of 15.7 checks per benchmark, resulting in the best speedup of 7.23% for 177.mesa.

[1]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[2]  Hao Zhou,et al.  A Compiler Approach for Exploiting Partial SIMD Parallelism , 2016, ACM Trans. Archit. Code Optim..

[3]  Albert Cohen,et al.  Polyhedral-Model Guided Loop-Nest Auto-Vectorization , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[4]  Vivek Sarkar,et al.  Efficient Selection of Vector Instructions Using Dynamic Programming , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[5]  Timothy M. Jones,et al.  PSLP: Padded SLP automatic vectorization , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[6]  Jaewook Shin,et al.  Superword-level parallelism in the presence of control flow , 2005, International Symposium on Code Generation and Optimization.

[7]  Monica S. Lam,et al.  Efficient context-sensitive pointer analysis for C programs , 1995, PLDI '95.

[8]  Eljas Soisalon-Soininen,et al.  On Finding the Strongly Connected Components in a Directed Graph , 1994, Inf. Process. Lett..

[9]  Saman P. Amarasinghe,et al.  Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.

[10]  Ondrej Lhoták,et al.  Points-to analysis with efficient strong updates , 2011, POPL '11.

[11]  Hongbin Zheng,et al.  Polly – Polyhedral optimization in LLVM , 2012 .

[12]  Sebastian Hack,et al.  Whole-function vectorization , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[13]  Hao Zhou,et al.  Exploiting mixed SIMD parallelism by reducing data reorganization overhead , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[14]  Jingling Xue,et al.  Interprocedural Side-Effect Analysis and Optimisation in the Presence of Dynamic Class Loading , 2005, ACSC.

[15]  Yannis Smaragdakis,et al.  Structure-Sensitive Points-To Analysis for C and C++ , 2016, SAS.

[16]  Jingling Xue,et al.  On-demand strong update analysis via value-flow refinement , 2016, SIGSOFT FSE.

[17]  Jingling Xue,et al.  Accelerating Dynamic Detection of Uses of Undefined Values with Static Value-Flow Analysis , 2014, CGO '14.

[18]  Ayal Zaks,et al.  Auto-vectorization of interleaved data for SIMD , 2006, PLDI '06.

[19]  Jingling Xue,et al.  Query-directed adaptive heap cloning for optimizing compilers , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[20]  Ben Hardekopf,et al.  Flow-sensitive pointer analysis for millions of lines of code , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[21]  Xiangke Liao,et al.  Boosting the precision of virtual call integrity protection with partial pointer analysis for C++ , 2017, ISSTA.

[22]  Paul S. Wang,et al.  Chains of recurrences—a method to expedite the evaluation of closed-form functions , 1994, ISSAC '94.

[23]  Robert A. van Engelen,et al.  Efficient Symbolic Analysis for Optimizing Compilers , 2001, CC.

[24]  Lars Ole Andersen,et al.  Program Analysis and Specialization for the C Programming Language , 2005 .

[25]  G. Ramalingam,et al.  The undecidability of aliasing , 1994, TOPL.

[26]  Jingling Xue,et al.  Sparse flow-sensitive pointer analysis for multithreaded programs , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[27]  Jingling Xue,et al.  Static memory leak detection using full-sparse value-flow analysis , 2012, ISSTA 2012.

[28]  Fernando Magno Quintão Pereira,et al.  Wave Propagation and Deep Propagation for Pointer Analysis , 2009, 2009 International Symposium on Code Generation and Optimization.

[29]  Martin C. Rinard,et al.  Symbolic bounds analysis of pointers, array indices, and accessed memory regions , 2005, TOPL.

[30]  Xin Zheng,et al.  Demand-driven alias analysis for C , 2008, POPL '08.

[31]  Jaewook Shin Introducing Control Flow into Vectorized Code , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[32]  Fernando Magno Quintão Pereira,et al.  Symbolic range analysis of pointers , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[33]  Manu Sridharan,et al.  Refinement-based context-sensitive points-to analysis for Java , 2006, PLDI '06.

[34]  David A. Padua,et al.  An Evaluation of Vectorizing Compilers , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[35]  Jingling Xue,et al.  SVF: interprocedural static value-flow analysis in LLVM , 2016, CC.

[36]  Chris Hankin,et al.  Efficient field-sensitive pointer analysis of C , 2007, TOPL.

[37]  Jingling Xue,et al.  Region-Based Selective Flow-Sensitive Pointer Analysis , 2014, SAS.

[38]  Ayal Zaks,et al.  Outer-loop vectorization - revisited for short SIMD architectures , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[39]  Ben Hardekopf,et al.  The ant and the grasshopper: fast and accurate pointer analysis for millions of lines of code , 2007, PLDI '07.

[40]  Sorin A. Huss,et al.  Fast Points-to Analysis for Languages with Structured Types , 2004, SCOPES.