Using Arm Scalable Vector Extension to Optimize OPEN MPI

As the scale of high-performance computing (HPC) systems continues to grow, increasing levels of parallelism must be implored to achieve optimal performance. Recently, the processors support wide vector extensions, vectorization becomes much more important to exploit the potential peak performance of target architecture. Novel processor architectures, such as the Armv8-A architecture, introduce Scalable Vector Extension (SVE) - an optional separate architectural extension with a new set of A64 instruction encodings, which enables even greater parallelisms.In this paper, we analyze the usage and performance of the SVE instructions in Arm SVE vector Instruction Set Architecture (ISA); and utilize those instructions to improve the memcpy and various local reduction operations. Furthermore, we propose new strategies to improve the performance of MPI operations including datatype packing/unpacking and MPI reduction. With these optimizations, we not only provide a higher-parallelism for a single node, but also achieve a more efficient communication scheme of message exchanging. The resulting efforts have been implemented in the context of OPEN MPI, providing efficient and scalable capabilities of SVE usage and extending the possible implementations of SVE to a more extensive range of programming and execution paradigms. The evaluation of the resulting software stack under different scenarios with both simulator and Fujitsu’s A64FX processor demonstrates that the solution is at the same time generic and efficient.

[1]  Mitsuhisa Sato,et al.  Preliminary Performance Evaluation of Application Kernels Using ARM SVE with Multiple Vector Lengths , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[2]  Kenneth A. Ross,et al.  Rethinking SIMD Vectorization for In-Memory Databases , 2015, SIGMOD Conference.

[3]  Jack Dongarra,et al.  GPU-Aware Non-contiguous Data Movement In Open MPI , 2016, HPDC.

[4]  Armin Kobilica,et al.  Simulation of ARM and x86 microprocessors using in-order and out-of-order CPU models with Gem5 simulator , 2018, 2018 5th International Conference on Electrical and Electronic Engineering (ICEEE).

[5]  George Bosilca,et al.  ADAPT: an event-based adaptive collective communication framework , 2018, HPDC.

[6]  Dhabaleswar K. Panda,et al.  Zero-Copy MPI Derived Datatype Communication over InfiniBand , 2004, PVM/MPI.

[7]  Jesper Larsson Träff Transparent Neutral Element Elimination in MPI Reduction Operations , 2010, EuroMPI.

[8]  George Bosilca,et al.  Runtime level failure detection and propagation in HPC systems , 2019, EuroMPI.

[9]  Gudula Rünger,et al.  MPI Reduction Operations for Sparse Floating-point Data , 2008, PVM/MPI.

[10]  Magnus Jahre,et al.  Scalability analysis of AVX-512 extensions , 2019, The Journal of Supercomputing.

[11]  Mateo Valero,et al.  Using Arm’s scalable vector extension on stencil codes , 2019, The Journal of Supercomputing.

[12]  Ryan E. Grant,et al.  Fuzzy Matching: Hardware Accelerated MPI Communication Middleware , 2019, 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[13]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[14]  Ali Sezgin,et al.  Modelling the ARMv8 architecture, operationally: concurrency and ISA , 2016, POPL.

[15]  Gilad Shainer,et al.  Using InfiniBand Hardware Gather-Scatter Capabilities to Optimize MPI All-to-All , 2016, EuroMPI.

[16]  Yen-Chen Liu,et al.  Knights Landing: Second-Generation Intel Xeon Phi Product , 2016, IEEE Micro.

[17]  Jack Dongarra,et al.  ScaLAPACK user's guide , 1997 .

[18]  Mateo Valero,et al.  Stencil codes on a vector length agnostic architecture , 2018, PACT.

[19]  Dhabaleswar K. Panda,et al.  CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[20]  Bashir M. Al-Hashimi,et al.  Advanced SIMD: Extending the reach of contemporary SIMD architectures , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[21]  David A. Padua,et al.  An Evaluation of Vectorizing Compilers , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.