Using Arm’s scalable vector extension on stencil codes

Data-level parallelism is frequently ignored or underutilized. Achieved through vector/SIMD capabilities, it can provide substantial performance improvements on top of widely used techniques such as thread-level parallelism. However, manual vectorization is a tedious and costly process that needs to be repeated for each specific instruction set or register size. In addition, automatic compiler vectorization is susceptible to code complexity, and usually limited due to data and control dependencies. To address some of these issues, Arm recently released a new vector ISA, the scalable vector extension (SVE), which is vector-length agnostic (VLA). VLA enables the generation of binary files that run regardless of the physical vector register length. In this paper, we leverage the main characteristics of SVE to implement and optimize stencil computations, ubiquitous in scientific computing. We show that SVE enables easy deployment of textbook optimizations like loop unrolling, loop fusion, load trading or data reuse. Our detailed simulations using vector lengths ranging from 128 to 2048 bits show that these optimizations can lead to performance improvements over straightforward vectorized code of up to 1.57 $$\times$$ × . In addition, we show that certain optimizations can hurt performance due to reduced arithmetic intensity and instruction overheads, and provide insight useful for compiler optimizers.

[1]  Samuel Williams,et al.  Implicit and explicit optimizations for stencil computations , 2006, MSPC '06.

[2]  Paul Walker,et al.  The ARM Scalable Vector Extension , 2017, IEEE Micro.

[3]  Ferenc Molnar,et al.  Simulation of reaction–diffusion processes in three dimensions using CUDA , 2011 .

[4]  Gordon Erlebacher,et al.  High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster , 2010, J. Comput. Phys..

[5]  James Reinders,et al.  High Performance Parallelism Pearls: Multicore and Many-core Programming Approaches , 2014 .

[6]  Weiqiang Wang,et al.  A Multilevel Parallelization Framework for High-Order Stencil Computations , 2009, Euro-Par.

[7]  Mateo Valero,et al.  Vector architectures: past, present and future , 1998, ICS '98.

[8]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Richard M. Russell,et al.  The CRAY-1 computer system , 1978, CACM.

[10]  Lukasz Szustak Czestochowa,et al.  Toward efficient distribution of MPDATA stencil computation on Intel MIC architecture , 2013 .

[11]  Yunsup Lee,et al.  The RISC-V Instruction Set Manual , 2014 .

[12]  Sandia Report,et al.  Improving Performance via Mini-applications , 2009 .

[13]  Charles Yount,et al.  Vector Folding: Improving Stencil Performance via Multi-dimensional SIMD-vector Representation , 2015, 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems.

[14]  Alejandro Duran,et al.  YASK—Yet Another Stencil Kernel: A Framework for HPC Stencil Code-Generation and Tuning , 2016, 2016 Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC).

[15]  Volker Strumpen,et al.  Cache oblivious stencil computations , 2005, ICS '05.

[16]  N. D. Novikova,et al.  Multigrid effectiveness on modern computing architectures , 2015, Programming and Computer Software.

[17]  Andrew Waterman,et al.  The RISC-V Instruction Set Manual. Volume 1: User-Level ISA, Version 2.0 , 2014 .

[18]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[19]  Samuel Williams,et al.  An auto-tuning framework for parallel multicore stencil computations , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[20]  Helmar Burkhart,et al.  PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[21]  Antônio Carlos de Abreu Mól,et al.  GPU-based Monte Carlo simulation in neutron transport and finite differences heat equation evaluation , 2011 .

[22]  Leonid Oliker,et al.  Impact of modern memory subsystems on cache optimizations for stencil computations , 2005, MSP '05.

[23]  Satoshi Matsuoka,et al.  Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[24]  Bradley C. Kuszmaul,et al.  The pochoir stencil compiler , 2011, SPAA '11.

[25]  Tommaso Toffoli,et al.  Cellular automata machines - a new environment for modeling , 1987, MIT Press series in scientific computation.

[26]  Liu Peng,et al.  High-order stencil computations on multicore clusters , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[27]  Christian Lengauer,et al.  Optimization of two Jacobi Smoother Kernels by Domain-Specific Program Transformation , 2014 .