Exploring SIMD for Molecular Dynamics , Using Intel

We analyse gather-scatter performance bottlenecks in molecular dynamics codes and the challenges that they pose for obtaining benefits from SIMD execution. This analysis informs a number of novel code-level and algorithmic improvements to Sandia’s miniMD benchmark, which we demonstrate using three SIMD widths (128-, 256and 512bit). The applicability of these optimisations to wider SIMD is discussed, and we show that the conventional approach of exposing more parallelism through redundant computation is not necessarily best. In single precision, our optimised implementation is up to 5x faster than the original scalar code running on Intel R Xeon R processors with 256-bit SIMD, and adding a single Intel R Xeon Phi TM coprocessor provides up to an additional 2x performance increase. These results demonstrate: (i) the importance of effective SIMD utilisation for molecular dynamics codes on current and future hardware; and (ii) the considerable performance increase afforded by the use of Intel R Xeon Phi TM coprocessors for highly parallel workloads. Keywords-scientific computing; accelerator architectures; parallel programming; performance analysis; high performance computing

[1]  M J Harvey,et al.  ACEMD: Accelerating Biomolecular Dynamics in the Microsecond Time Scale. , 2009, Journal of chemical theory and computation.

[2]  Sandia Report,et al.  Improving Performance via Mini-applications , 2009 .

[3]  D. R. Mason,et al.  Faster neighbour list generation using a novel lattice vector representation , 2005, Comput. Phys. Commun..

[4]  Uday Bondhugula,et al.  Believe it or Not! Multicore CPUs can Match GPUs for FLOP-intensive Applications , 2010 .

[5]  Edmond Chow,et al.  Exploiting 162-Nanosecond End-to-End Communication Latency on Anton , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Svetlana Artemova,et al.  A comparison of neighbor search algorithms for large rigid molecules , 2011, J. Comput. Chem..

[7]  Gui-Rong Liu,et al.  Improved neighbor list algorithm in molecular simulations using cell decomposition and data sorting method , 2004, Comput. Phys. Commun..

[8]  Philippe H. Hünenberger,et al.  A fast pairlist‐construction algorithm for molecular simulations under periodic boundary conditions , 2004, J. Comput. Chem..

[9]  Stephen L. Olivier,et al.  Porting the GROMACS Molecular Dynamics Code to the Cell Processor , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[10]  Bernard R. Brooks,et al.  An improved method for nonbonded list generation: Rapid determination of near‐neighbor pairs , 2003, J. Comput. Chem..

[11]  Sadaf R. Alam,et al.  Optimal Utilization of Heterogeneous Resources for Biomolecular Simulations , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  S. Meloni,et al.  Efficient particle labeling in atomistic simulations. , 2007, The Journal of chemical physics.

[13]  William D. Mattson,et al.  Near-neighbor calculations using a modified cell-linked list method , 1999 .

[14]  Joshua A. Anderson,et al.  General purpose molecular dynamics simulations fully implemented on graphics processing units , 2008, J. Comput. Phys..

[15]  L. Verlet Computer "Experiments" on Classical Fluids. I. Thermodynamical Properties of Lennard-Jones Molecules , 1967 .

[16]  Peng Wang,et al.  Implementing molecular dynamics on hybrid high performance computers - short range forces , 2011, Comput. Phys. Commun..

[17]  Ryutaro Himeno,et al.  A 55 TFLOPS simulation of amyloid-forming peptides from yeast prion Sup35 with the special-purpose computer system MDGRAPE-3 , 2006, SC.

[18]  Steve Plimpton,et al.  Fast parallel algorithms for short-range molecular dynamics , 1993 .

[19]  Pedro Gonnet,et al.  A simple algorithm to accelerate the computation of non‐bonded interactions in cell‐based molecular dynamics simulations , 2007, J. Comput. Chem..

[20]  Uday Bondhugula,et al.  Can CPUs Match GPUs on Performance with Productivity?: Experiences with Optimizing a FLOP-intensive Application on CPUs and GPU , 2010 .

[21]  Christian Trott,et al.  LAMMPScuda - a new GPU accelerated Molecular Dynamics Simulations Package and its Application to Ion-Conducting Glasses , 2012 .

[22]  Pradeep Dubey,et al.  Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.

[23]  Christopher J. Hughes,et al.  Atomic Vector Operations on Chip Multiprocessors , 2008, 2008 International Symposium on Computer Architecture.

[24]  Murat Efe Guney,et al.  On the limits of GPU acceleration , 2010 .

[25]  Klaus Schulten,et al.  Adapting a message-driven parallel application to GPU-accelerated clusters , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[26]  Tatiana Maximova,et al.  A Novel Algorithm for Non-Bonded-List Updating in Molecular Simulations , 2006, J. Comput. Biol..

[27]  Seonggun Kim,et al.  Efficient SIMD code generation for irregular kernels , 2012, PPoPP '12.

[28]  Pedro Gonnet Pairwise verlet lists: Combining cell lists and verlet lists to improve memory locality and parallelism , 2012, J. Comput. Chem..

[29]  David E. Shaw,et al.  A fast, scalable method for the parallel evaluation of distance‐limited pairwise particle interactions , 2005, J. Comput. Chem..

[30]  Q. F Fang,et al.  Movable hash algorithm for search of the neighbor atoms in molecular dynamics simulation , 2002 .