Optimizing Overlapped Memory Accesses in User-directed Vectorization

Current processors incorporate wide and powerful vector units whose optimal exploitation is crucial to reach peak performance. However, present autovectorizing compilers fall short of that goal. Exploiting some vector instructions requires aggressive approaches that are not affordable in production compilers. Thus, advanced programmers pursuing the best performance from their applications are compelled to manually vectorize them using low-level SIMD intrinsics. We propose a user-directed code optimization that targets overlapped vector loads, i.e., vector loads that read scalar elements redundantly from memory. Instead, our optimization loads these elements once and combines them using advanced register-to-register vector instructions.This code is potentially more efficient and it uses advanced vector instructions that compilers do not widely exploit automatically. We also extend the OpenMP* SIMD directives with a new clause called overlap that allows users to easily enable and tune this optimization on demand. We implement our proposal for the Intel® Xeon Phi™ coprocessor. Our evaluation shows up to 29% speed-up over five highly-optimized stencil kernels and workloads from real-world applications. Results also demonstrate how important user hints are to maximize performance.

[1]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools (2nd Edition) , 2006 .

[2]  David A. Padua,et al.  Advanced compiler optimizations for supercomputers , 1986, CACM.

[3]  Jim Jeffers,et al.  Chapter 10 – Linux on the Coprocessor , 2013 .

[4]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[5]  Nick Knupffer Intel Corporation , 2018, The Grants Register 2019.

[6]  Ayal Zaks,et al.  Outer-loop vectorization - revisited for short SIMD architectures , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[7]  Leonid Oliker,et al.  Impact of modern memory subsystems on cache optimizations for stencil computations , 2005, MSP '05.

[8]  John McCutchan,et al.  A SIMD programming model for dart, javascript,and other dynamically typed scripting languages , 2014, WPMVP '14.

[9]  David A. Padua,et al.  An Evaluation of Vectorizing Compilers , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[10]  Mauricio Hanzich,et al.  3D seismic imaging through reverse-time migration on homogeneous and heterogeneous multi-core processors , 2009, Sci. Program..

[11]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Saman P. Amarasinghe,et al.  Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.

[13]  Peng Wu,et al.  Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.

[14]  Albert Cohen,et al.  Polyhedral-Model Guided Loop-Nest Auto-Vectorization , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[15]  Aart J. C. Bik Software Vectorization Handbook, The: Applying Intel Multimedia Extensions for Maximum Performance , 2004 .

[16]  Pradeep Dubey,et al.  Can traditional programming bridge the Ninja performance gap for parallel computing applications? , 2015, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[17]  Sebastian Hack,et al.  Whole-function vectorization , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[18]  Ingo Wald,et al.  Extending a C-like language for portable SIMD programming , 2012, PPoPP '12.

[19]  Alejandro Duran,et al.  Mercurium: Design Decisions for a S2S Compiler , 2011 .

[20]  Alejandro Duran,et al.  Extending OpenMP* with Vector Constructs for Modern Multicore SIMD Architectures , 2012, IWOMP.

[21]  Mauricio Araya-Polo,et al.  Algorithm 942 , 2014 .

[22]  Emre Kultursay,et al.  Compiler-Based Data Prefetching and Streaming Non-temporal Store Generation for the Intel(R) Xeon Phi(TM) Coprocessor , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[23]  Samuel Williams,et al.  Auto-Tuning the 27-point Stencil for Multicore , 2009 .

[24]  Lionel Lacassagne,et al.  High level transforms for SIMD and low-level computer vision algorithms , 2014, WPMVP '14.

[25]  Jaewook Shin,et al.  Compiler-controlled caching in superword register files for multimedia extension architectures , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[26]  Michael Wolfe,et al.  Iteration Space Tiling for Memory Hierarchies , 1987, PPSC.

[27]  Ayal Zaks,et al.  Vectorizing for a SIMdD DSP architecture , 2003, CASES '03.

[28]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[29]  Richard Veras,et al.  When polyhedral transformations meet SIMD code generation , 2013, PLDI.

[30]  James Reinders,et al.  Intel Xeon Phi Coprocessor High Performance Programming , 2013 .

[31]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .