Evaluation of Parallel Communication Models in Nekbone, a Nek5000 Mini-Application

Nekbone is a proxy application of Nek5000, a scalable Computational Fluid Dynamics (CFD) code used for modelling incompressible flows. The Nekbone mini-application is used by several international co-design centers to explore new concepts in computer science and to evaluate their performance. We present the design and implementation of a new communication kernel in the Nekbone mini-application with the goal of studying the performance of different parallel communication models. First, a new MPI blocking communication kernel has been developed to solve Nekbone problems in a three-dimensional Cartesian mesh and process topology. The new MPI implementation delivers a 13% performance improvement compared to the original implementation. The new MPI communication kernel consists of approximately 500 lines of code against the original 7,000 lines of code, allowing experimentation with new approaches in Nekbone parallel communication. Second, the MPI blocking communication in the new kernel was changed to the MPI non-blocking communication. Third, we developed a new Partitioned Global Address Space (PGAS) communication kernel, based on the GPI-2 library. This approach reduces the synchronization among neighbor processes and is on average 3% faster than the new MPI-based, non-blocking, approach. In our tests on 8,192 processes, the GPI-2 communication kernel is 3% faster than the new MPI non-blocking communication kernel. In addition, we have used the OpenMP in all the versions of the new communication kernel. Finally, we highlight the future steps for using the new communication kernel in the parent application Nek5000.

[1]  Gokcen Kestor,et al.  Enabling accurate power profiling of HPC applications on exascale systems , 2013, ROSS '13.

[2]  Torsten Suel,et al.  BSPlib: The BSP programming library , 1998, Parallel Comput..

[3]  Gokcen Kestor,et al.  Cross-Layer Self-Adaptive/Self-Aware System Software for Exascale Systems , 2014, 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing.

[4]  Abhinav Vishnu,et al.  A performance comparison of current HPC systems: Blue Gene/Q, Cray XE6 and InfiniBand systems , 2014, Future Gener. Comput. Syst..

[5]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[6]  Lena Oden,et al.  GPI2 for GPUs: A PGAS framework for efficient communication in hybrid clusters , 2013, PARCO.

[7]  David Henty,et al.  Auto-tuning an OpenACC Accelerated Version of Nek5000 , 2014, EASC.

[8]  George L.-T. Chiu,et al.  Tracking the Performance Evolution of Blue Gene Systems , 2013, ISC.

[9]  William Gropp,et al.  Reproducible Measurements of MPI Performance Characteristics , 1999, PVM/MPI.

[10]  G. C. Fox,et al.  Hypercube algorithms for neural network simulation: the Crystal_Accumulator and the Crystal Router , 1988, C3P.

[11]  Juan Touriño,et al.  Performance Evaluation of MPI, UPC and OpenMP on Multicore Architectures , 2009, PVM/MPI.

[12]  A. Patera A spectral element method for fluid dynamics: Laminar flow in a channel expansion , 1984 .

[13]  H.M. Tufo,et al.  Terascale Spectral Element Algorithms and Implementations , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[14]  Martin Schulz,et al.  Exploring the Capabilities of the New MPI_T Interface , 2014, EuroMPI/ASIA.

[15]  Adolfy Hoisie,et al.  Palm: easing the burden of analytical performance modeling , 2014, ICS '14.

[16]  Torsten Hoefler,et al.  Using Advanced MPI: Modern Features of the Message-Passing Interface , 2014 .

[17]  Axel Y. Rivera Using autotuning for accelerating tensor contraction on graphics processing units (GPUs) , 2014 .

[18]  Gokcen Kestor,et al.  Online Monitoring System for Performance Fault Detection , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[19]  William Gropp,et al.  Learning from the Success of MPI , 2001, HiPC.

[20]  Pavan Balaji,et al.  A Framework for Tracking Memory Accesses in Scientific Applications , 2014, 2014 43rd International Conference on Parallel Processing Workshops.

[21]  Erwin Laure,et al.  OpenACC acceleration of the Nek5000 spectral element code , 2015, Int. J. High Perform. Comput. Appl..

[22]  Shuaiwen Song,et al.  Unified performance and power modeling of scientific workloads , 2013, E2SC '13.

[23]  Torsten Hoefler,et al.  Notified Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchronization , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[24]  C. Simmendinger,et al.  The GASPI API specification and its implementation GPI 2.0 , 2013 .

[25]  Torsten Hoefler,et al.  Enabling highly-scalable remote memory access programming with MPI-3 One Sided , 2014, Sci. Program..

[26]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[27]  Dan S. Henningson,et al.  Nek5000 with OpenACC , 2014, EASC.

[28]  Abhinav Vishnu,et al.  Comparing the Performance of Blue Gene/Q with Leading Cray XE6 and InfiniBand Systems , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.

[29]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..