Rapid Exploration of Optimization Strategies on Advanced Architectures using TestSNAP and LAMMPS

The exascale race is at an end with the announcement of the Aurora and Frontier machines. This next generation of supercomputers utilize diverse hardware architectures to achieve their compute performance, providing an added onus on the performance portability of applications. An expanding fragmentation of programming models would provide a compounding optimization challenge were it not for the evolution of performance-portable frameworks, providing unified models for mapping abstract hierarchies of parallelism to diverse architectures. A solution to this challenge is the evolution of performance-portable frameworks, providing unified models for mapping abstract hierarchies of parallelism to diverse architectures. Kokkos is one such performance portable programming model for C++ applications, providing back-end implementations for each major HPC platform. Even with a performance portable framework, restructuring algorithms to expose higher degrees of parallelism is non-trivial. The Spectral Neighbor Analysis Potential (SNAP) is a machine-learned inter-atomic potential utilized in cutting-edge molecular dynamics simulations. Previous implementations of the SNAP calculation showed a downward trend in their performance relative to peak on newer-generation CPUs and low performance on GPUs. In this paper we describe the restructuring and optimization of SNAP as implemented in the Kokkos CUDA backend of the LAMMPS molecular dynamics package, benchmarked on NVIDIA GPUs. We identify novel patterns of hierarchical parallelism, facilitating a minimization of memory access overheads and pushing the implementation into a compute-saturated regime. Our implementation via Kokkos enables recompile-and-run efficiency on upcoming architectures. We find a $\sim$22x time-to-solution improvement relative to an existing implementation as measured on an NVIDIA Tesla V100-16GB for an important benchmark.

[1]  Simon D. Hammond,et al.  SNAP: Strong Scaling High Fidelity Molecular Dynamics Simulations on Leadership-Class Computing Platforms , 2014, ISC.

[2]  Michele Parrinello,et al.  Generalized neural-network representation of high-dimensional potential-energy surfaces. , 2007, Physical review letters.

[3]  Christian Trott,et al.  Spectral neighbor analysis method for automated generation of quantum-accurate interatomic potentials , 2014, J. Comput. Phys..

[4]  Steve Plimpton,et al.  Fast parallel algorithms for short-range molecular dynamics , 1993 .

[5]  R. Kondor,et al.  On representing chemical environments , 2012, 1209.3140.

[6]  Olga Pearce,et al.  RAJA: Portable Performance for Large-Scale Scientific Applications , 2019, 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC).

[7]  Daniel Sunderland,et al.  Kokkos: Enabling manycore performance portability through polymorphic memory access patterns , 2014, J. Parallel Distributed Comput..

[8]  Ludek Matyska,et al.  Optimizing CUDA code by kernel fusion: application on BLAS , 2013, The Journal of Supercomputing.

[9]  R. Kondor,et al.  Gaussian approximation potentials: the accuracy of quantum mechanics, without the electrons. , 2009, Physical review letters.

[10]  Markus Bachmayr,et al.  Approximation of Potential Energy Surfaces with Spherical Harmonics , 2019, ArXiv.

[11]  Ralf Drautz,et al.  Atomic cluster expansion for accurate and transferable interatomic potentials , 2019, Physical Review B.

[12]  K. Müller,et al.  Fast and accurate modeling of molecular atomization energies with machine learning. , 2011, Physical review letters.