Temporal Vectorization for Stencils

Stencil computations represent a very common class of nested loops in scientific and engineering applications. Exploiting vector units in modern CPUs is crucial to achieving peak performance. Previous vectorization approaches often consider the data space, in particular the innermost unit-strided loop. It leads to the well-known data alignment conflict problem that vector loads are overlapped due to the data sharing between continuous stencil computations. This paper proposes a novel temporal vectorization scheme for stencils. It vectorizes the stencil computation in the iteration space and assembles points with different time coordinates in one vector. The temporal vectorization leads to a small fixed number of vector reorganizations that is irrelevant to the vector length, stencil order, and dimension. Furthermore, it is also applicable to Gauss-Seidel stencils, whose vectorization is not well-studied. The effectiveness of the temporal vectorization is demonstrated by various Jacobi and Gauss-Seidel stencils.

[1]  Chau-Wen Tseng,et al.  Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[2]  Samuel Williams,et al.  Compiler-Directed Transformation for Higher-Order Stencils , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[3]  Alejandro Duran,et al.  YASK—Yet Another Stencil Kernel: A Framework for HPC Stencil Code-Generation and Tuning , 2016, 2016 Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC).

[4]  Franz Franchetti,et al.  Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures , 2011, CC.

[5]  Charles Yount,et al.  Vector Folding: Improving Stencil Performance via Multi-dimensional SIMD-vector Representation , 2015, 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems.

[6]  Hao Zhou,et al.  Exploiting mixed SIMD parallelism by reducing data reorganization overhead , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[7]  Ulrich Rüde,et al.  Cache Optimization for Structured and Unstructured Grid Multigrid , 2000 .

[8]  Ayal Zaks,et al.  Auto-vectorization of interleaved data for SIMD , 2006, PLDI '06.

[9]  Ulrich Rüde,et al.  Cache-Aware Multigrid Methods for Solving Poisson's Equation in Two Dimensions , 2000, Computing.

[10]  Ayal Zaks,et al.  Outer-loop vectorization - revisited for short SIMD architectures , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[11]  Pradeep Dubey,et al.  Efficient Shared-Memory Implementation of High-Performance Conjugate Gradient Benchmark and its Application to Unstructured Matrices , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Michael Kazhdan,et al.  Streaming multigrid for gradient-domain operations on large images , 2008, SIGGRAPH 2008.

[13]  Gang Ren,et al.  Optimizing data permutations for SIMD devices , 2006, PLDI '06.

[14]  Uday Bondhugula,et al.  Tiling stencil computations to maximize parallelism , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Ken Kennedy,et al.  Automatic translation of FORTRAN programs to vector form , 1987, TOPL.

[16]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Richard Veras,et al.  A stencil compiler for short-vector SIMD architectures , 2013, ICS '13.

[18]  David A. Padua,et al.  An Evaluation of Vectorizing Compilers , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[19]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[20]  Shan Huang,et al.  Tessellating Stencils , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  Kevin Skadron,et al.  Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs , 2009, ICS.

[22]  Hao Zhou,et al.  A Compiler Approach for Exploiting Partial SIMD Parallelism , 2016, ACM Trans. Archit. Code Optim..

[23]  Peng Wu,et al.  Efficient SIMD code generation for runtime alignment and length conversion , 2005, International Symposium on Code Generation and Optimization.

[24]  Increasing and detecting memory address congruence , 2002, PACT 2002.

[25]  D LamMonica,et al.  The cache performance and optimizations of blocked algorithms , 1991 .

[26]  Jingling Xue,et al.  Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.

[27]  Mauricio Araya-Polo,et al.  Algorithm 942 , 2014 .

[28]  Hari Sundar,et al.  A Nested Partitioning Algorithm for Adaptive Meshes on Heterogeneous Clusters , 2015, ICS.

[29]  Saman P. Amarasinghe,et al.  Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.

[30]  Uday Bondhugula,et al.  Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[31]  Yun He,et al.  A Ghost Cell Expansion Method for Reducing Communications in Solving PDE Problems , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[32]  Volker Strumpen,et al.  Cache oblivious stencil computations , 2005, ICS '05.

[33]  Guoping Long,et al.  Highly Optimized Code Generation for Stencil Codes with Computation Reuse for GPUs , 2016, Journal of Computer Science and Technology.

[34]  Fabrice Rastello,et al.  Efficient tiling for an ODE discrete integration program: redundant tasks instead of trapezoidal shaped-tiles , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[35]  Steven J. Deitz,et al.  Eliminating redundancies in sum-of-product array computations , 2001, ICS '01.

[36]  Alejandro Duran,et al.  Optimizing Overlapped Memory Accesses in User-directed Vectorization , 2015, ICS.

[37]  Krste Asanovic,et al.  Compiling for vector-thread architectures , 2008, CGO '08.

[38]  David G. Wonnacott,et al.  Achieving Scalable Locality with Time Skewing , 2002, International Journal of Parallel Programming.

[39]  Siddhartha Chatterjee,et al.  Cache-Efficient Multigrid Algorithms , 2001, Int. J. High Perform. Comput. Appl..

[40]  Michael Wolfe,et al.  Loops skewing: The wavefront method revisited , 1986, International Journal of Parallel Programming.

[41]  Aart J. C. Bik,et al.  Automatic Intra-Register Vectorization for the Intel® Architecture , 2002, International Journal of Parallel Programming.

[42]  Mary W. Hall,et al.  Exploiting reuse and vectorization in blocked stencil computations on CPUs and GPUs , 2019, SC.

[43]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[44]  Guangwen Yang,et al.  Libra: an automated code generation and tuning framework for register-limited stencils on GPUs , 2016, Conf. Computing Frontiers.

[45]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[46]  Samuel Williams,et al.  Optimization of geometric multigrid for emerging multi- and manycore processors , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[47]  R. Govindarajan,et al.  A Vectorizing Compiler for Multimedia Extensions , 2000, International Journal of Parallel Programming.

[48]  Samuel Williams,et al.  Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers , 2017, Parallel Comput..

[49]  Luke N. Olson,et al.  Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods , 2012, SIAM J. Sci. Comput..

[50]  P. Sadayappan,et al.  Register optimizations for stencils on GPUs , 2018, PPoPP.

[51]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.