Developing Performance-Portable Molecular Dynamics Kernels in OpenCL

This paper investigates the development of a molecular dynamics code that is highly portable between architectures. Using OpenCL, we develop an implementation of Sandia's miniMD benchmark that achieves good levels of performance across a wide range of hardware: CPUs, discrete GPUs and integrated GPUs. We demonstrate that the performance bottlenecks of miniMD's short-range force calculation kernel are the same across these architectures, and detail a number of platform-agnostic optimisations that improve its performance by at least 2x on all hardware considered. Our complete code is shown to be 1.7x faster than the original miniMD, and at most 2x slower than implementations individually hand-tuned for a specific architecture.

[1]  Jungwon Kim,et al.  SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters , 2012, ICS '12.

[2]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .

[3]  Stephen L. Olivier,et al.  Porting the GROMACS Molecular Dynamics Code to the Cell Processor , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[4]  Peng Wang,et al.  Implementing molecular dynamics on hybrid high performance computers - short range forces , 2011, Comput. Phys. Commun..

[5]  Jack Dongarra,et al.  Top500 Supercomputer Sites , 1997 .

[6]  Jaejin Lee,et al.  Performance characterization of the NAS Parallel Benchmarks in OpenCL , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[7]  Sandia Report,et al.  Improving Performance via Mini-applications , 2009 .

[8]  Akila Gothandaraman,et al.  Comparing Hardware Accelerators in Scientific Applications: A Case Study , 2011, IEEE Transactions on Parallel and Distributed Systems.

[9]  Uday Bondhugula,et al.  Believe it or Not! Multicore CPUs can Match GPUs for FLOP-intensive Applications , 2010 .

[10]  L. Verlet Computer "Experiments" on Classical Fluids. I. Thermodynamical Properties of Lennard-Jones Molecules , 1967 .

[11]  Pradeep Dubey,et al.  Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.

[12]  Uday Bondhugula,et al.  Can CPUs Match GPUs on Performance with Productivity?: Experiences with Optimizing a FLOP-intensive Application on CPUs and GPU , 2010 .

[13]  Sebastian Hack,et al.  Improving Performance of OpenCL on CPUs , 2012, CC.

[14]  C. J. Hughes,et al.  Exploring SIMD for Molecular Dynamics , Using Intel , 2013 .

[15]  Jack J. Dongarra,et al.  From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming , 2012, Parallel Comput..

[16]  Stephen A. Jarvis,et al.  An investigation of the performance portability of OpenCL , 2013, J. Parallel Distributed Comput..

[17]  Christian Trott,et al.  LAMMPScuda - a new GPU accelerated Molecular Dynamics Simulations Package and its Application to Ion-Conducting Glasses , 2012 .

[18]  Joshua A. Anderson,et al.  General purpose molecular dynamics simulations fully implemented on graphics processing units , 2008, J. Comput. Phys..

[19]  Murat Efe Guney,et al.  On the limits of GPU acceleration , 2010 .

[20]  Steve Plimpton,et al.  Fast parallel algorithms for short-range molecular dynamics , 1993 .

[21]  Klaus Schulten,et al.  Adapting a message-driven parallel application to GPU-accelerated clusters , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  Stephen A. Jarvis,et al.  Exploring SIMD for Molecular Dynamics, Using Intel® Xeon® Processors and Intel® Xeon Phi Coprocessors , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[23]  Kristinn R. Thórisson,et al.  OpenAIR 1.0 specification , 2007 .

[24]  M J Harvey,et al.  ACEMD: Accelerating Biomolecular Dynamics in the Microsecond Time Scale. , 2009, Journal of chemical theory and computation.