Analysis of scalable data-privatization threading algorithms for hybrid MPI/OpenMP parallelization of molecular dynamics

We propose and analyze threading algorithms for hybrid MPI/OpenMP parallelization of a molecular-dynamics simulation, which are scalable on large multicore clusters. Two data-privatization thread scheduling algorithms via nucleation-growth allocation are introduced: (1) compact-volume allocation scheduling (CVAS); and (2) breadth-first allocation scheduling (BFAS). The algorithms combine fine-grain dynamic load balancing and minimal memory-footprint data privatization threading. We show that the computational costs of CVAS and BFAS are bounded by Θ(n5/3p−2/3) and Θ(n), respectively, for p threads working on n particles on a multicore compute node. Memory consumption per node of both algorithms scales as O(n+n2/3p1/3), but CVAS has smaller prefactors due to a geometric effect. Based on these analyses, we derive the selection criterion between the two algorithms in terms of the granularity, n/p. We observe that memory consumption is reduced by 75 % for p=16 and n=8,192 compared to a naïve data privatization, while maintaining thread imbalance below 5 %. We obtain a strong-scaling speedup of 14.4 with 16-way threading on a four quad-core AMD Opteron node. In addition, our MPI/OpenMP code achieves 2.58× and 2.16× speedups over the MPI-only implementation on 32,768 cores of BlueGene/P for 0.84 and 1.68 million particle systems, respectively.

[1]  Weiqiang Wang,et al.  A metascalable computing framework for large spatiotemporal-scale atomistic simulations , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[2]  T. Darden,et al.  Particle mesh Ewald: An N⋅log(N) method for Ewald sums in large systems , 1993 .

[3]  Darrin M. York,et al.  The fast Fourier Poisson method for calculating Ewald sums , 1994 .

[4]  John A. Gunnels,et al.  Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[5]  Anthony T. Chronopoulos,et al.  Dynamic multi phase scheduling for heterogeneous clusters , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[6]  Rajiv K. Kalia,et al.  Multi-Million Atoms Molecular Dynamics Study of Combustion Mechanism of Aluminum Nanoparticle , 2008 .

[7]  Ken Kennedy,et al.  Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings , 2001, International Journal of Parallel Programming.

[8]  Steven J. Plimpton,et al.  Implementing molecular dynamics on hybrid high performance computers - Particle-particle particle-mesh , 2012, Comput. Phys. Commun..

[9]  Samuel H. Fuller,et al.  Computing Performance: Game Over or Next Level? , 2011, Computer.

[10]  David E. Shaw,et al.  Zonal methods for the parallel execution of range-limited N-body simulations , 2007, J. Comput. Phys..

[11]  Shigeomi Chono,et al.  GPU-accelerated molecular dynamics simulation for study of liquid crystalline flows , 2010, J. Comput. Phys..

[12]  Sadaf R. Alam,et al.  Impact of multicores on large-scale molecular dynamics simulations , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[13]  R W Hockney,et al.  Computer Simulation Using Particles , 1966 .

[14]  Yali Liu,et al.  Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms , 2009, 2009 International Conference on Parallel Processing Workshops.

[15]  Éva Tardos,et al.  Algorithm design , 2005 .

[16]  Georg Hager,et al.  Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[17]  Weiqiang Wang,et al.  A Scalable Hierarchical Parallelization Framework for Molecular Dynamics Simulation on Multicore Clusters , 2009, PDPTA.

[18]  Sidney Yip,et al.  Computing the viscosity of supercooled liquids. , 2009, The Journal of chemical physics.

[19]  Weiqiang Wang,et al.  Exploiting hierarchical parallelisms for molecular dynamics simulation on multicore clusters , 2011, The Journal of Supercomputing.

[20]  John A. Gunnels,et al.  Beyond homogeneous decomposition: scaling long-range forces on Massively Parallel Systems , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[21]  John A. Gunnels,et al.  Simulating solidification in metals at high pressure: The drive to petascale computing , 2006 .

[22]  Ümit V. Çatalyürek,et al.  Hypergraph-based Dynamic Load Balancing for Adaptive Scientific Computations , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[23]  John R. Williams,et al.  An events based algorithm for distributing concurrent tasks on multi-core architectures , 2010, Comput. Phys. Commun..

[24]  John L. Klepeis,et al.  Millisecond-scale molecular dynamics simulations on Anton , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[25]  Laxmikant V. Kalé,et al.  NAMD: Biomolecular Simulation on Thousands of Processors , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[26]  Samuel Williams,et al.  Memory-efficient optimization of Gyrokinetic particle-to-grid interpolation for multicore processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[27]  Anthony T. Chronopoulos,et al.  Implementation of Distributed Loop Scheduling Schemes on the TeraGrid , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[28]  Francieli Zanon Boito,et al.  Improving Performance on Atmospheric Models through a Hybrid OpenMP/MPI Implementation , 2011, 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications.

[29]  James N. Glosli,et al.  Dynamic load balancing algorithm for molecular dynamics based on Voronoi cells domain decompositions , 2012, Comput. Phys. Commun..

[30]  Yunfei Chen,et al.  GPU accelerated molecular dynamics simulation of thermal conductivities , 2007, J. Comput. Phys..

[31]  David W. Walker,et al.  Hybrid Message-Passing and Shared-Memory Programming in a Molecular Dynamics Application On Multicore Clusters , 2009, Int. J. High Perform. Comput. Appl..

[32]  Carsten Kutzner,et al.  GROMACS 4:  Algorithms for Highly Efficient, Load-Balanced, and Scalable Molecular Simulation. , 2008, Journal of chemical theory and computation.