Vectorization of High-performance Scientific Calculations Using AVX-512 Intruction Set

Modern calculation codes used in supercomputing are very demanding of computing resources. For their effective appliance requires the use of parallelization at all levels, starting with the use of multiprocess and multi-threaded programming, and ending with vectorization. The AVX-512 instruction set, first introduced in Intel Xeon Phi Knights Landing and Intel Xeon Skylake microprocessors, opens up broad possibilities for vectorizing code and allows to speed up the execution of applications in several times. This article discusses some aspects of the application of vectorization in the program code of some kinds, which is found in high-performance scientific computing.

[1]  Katharina Kormann,et al.  Fast Matrix-Free Discontinuous Galerkin Kernels on Modern Computer Architectures , 2017, ISC.

[2]  Scott Mahlke,et al.  Effective compiler support for predicated execution using the hyperblock , 1992, MICRO 1992.

[3]  Jim Jeffers,et al.  Chapter 10 – Linux on the Coprocessor , 2013 .

[4]  Berenger Bramas A Novel Hybrid Quicksort Algorithm Vectorized using AVX-512 on Intel Skylake , 2017 .

[5]  David E. Keyes,et al.  Communication Reducing Algorithms for Distributed Hierarchical N-Body Problems with Boundary Distributions , 2017, ISC.

[6]  Holger Fröning,et al.  An Overview of MPI Characteristics of Exascale Proxy Applications , 2017, ISC.

[7]  Shay Gueron,et al.  Fast Quicksort Implementation Using AVX Instructions , 2016, Comput. J..

[8]  Karl Rupp,et al.  Vectorized Parallel Sparse Matrix-Vector Multiplication in PETSc Using AVX-512 , 2018, ICPP.

[9]  Paolo Bientinesi,et al.  LAMMPS' PPPM Long-Range Solver for the Second Generation Xeon Phi , 2017, ISC.

[10]  Samuel Williams,et al.  Applying the Roofline Performance Model to the Intel Xeon Phi Knights Landing Processor , 2016, ISC Workshops.

[11]  Jim Jeffers,et al.  Knights Landing overview , 2016 .

[12]  Bérenger Bramas,et al.  Fast Sorting Algorithms using AVX-512 on Intel Knights Landing , 2017, ArXiv.

[13]  Scott A. Mahlke,et al.  The superblock: An effective technique for VLIW and superscalar compilation , 1993, The Journal of Supercomputing.

[14]  Rakesh Krishnaiyer,et al.  Automated Compiler Optimization of Multiple Vector Loads/Stores , 2016, International Journal of Parallel Programming.

[15]  Jack J. Dongarra,et al.  Task-Based Cholesky Decomposition on Knights Corner Using OpenMP , 2016, ISC Workshops.

[16]  S. Riedelbauch,et al.  Scale Resolving Flow Simulations of a Francis Turbine Using Highly Parallel CFD Simulations , 2016 .

[17]  Olaf Krzikalla,et al.  Dynamic SIMD Vector Lane Scheduling , 2016, ISC Workshops.

[18]  Patrick Diehl,et al.  Closing the Performance Gap with Modern C , 2016, HiPC 2016.

[19]  Ulrich Rüde,et al.  Fully Resolved Simulations of Dune Formation in Riverbeds , 2017, ISC.

[20]  D. A. Lyubimov,et al.  Development and application of a high-resolution technique for jet flow computation using large eddy simulation , 2012 .

[21]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[22]  Charles Yount,et al.  Accelerating Seismic Simulations Using the Intel Xeon Phi Knights Landing Processor , 2017, ISC.

[23]  John O'Neill,et al.  High Performance Optimizations for Nuclear Physics Code MFDn on KNL , 2016, ISC Workshops.

[24]  Jesper Larsson Träff,et al.  The EPiGRAM Project: Preparing Parallel Programming Models for Exascale , 2016, ISC Workshops.

[25]  E. Toro Riemann Solvers and Numerical Methods for Fluid Dynamics , 1997 .

[26]  Thorsten Kurth,et al.  Optimization of the Sparse Matrix-Vector Products of an IDR Krylov Iterative Solver in EMGeo for the Intel KNL Manycore Processor , 2016, ISC Workshops.

[27]  Yida Wang,et al.  High-Performance Incremental SVM Learning on Intel® Xeon Phi™ Processors , 2017, ISC.

[28]  D. Lyubimov,et al.  The Use of the RANS/ILES Method to Study the Influence of Coflow Wind on the Flow in a Hot, Nonisobaric, Supersonic Airdrome Jet during Its Interaction with the Jet Blast Deflector , 2018 .

[29]  A. Rybakov Inner representation and crossprocess exchange mechanism for block-structured grid for supercomputer calculations , 2017 .

[30]  Kent Milfeld,et al.  A Comparative Study of Application Performance and Scalability on the Intel Knights Landing Processor , 2016, ISC Workshops.

[31]  А.А. Рыбаков,et al.  Оптимизация задачи об определении конфликтов с опасными зонами движения летательных аппаратов для выполнения на Intel Xeon Phi@@@Optimization of the problem of conflict detection with dangerous aircraft movement areas to execute on Intel Xeon Phi , 2017 .

[32]  Vladimir Roganov,et al.  Solving the 2D Poisson PDE by Gauss-Seidel method with parallel programming system OpenTS , 2016 .

[33]  Timothy G. Mattson,et al.  A New Parallel Research Kernel to Expand Research on Dynamic Load-Balancing Capabilities , 2017, ISC.