Scalable Data-Privatization Threading for Hybrid MPI/OpenMP Parallelization of Molecular Dynamics

Calculation of the Coulomb potential in the molecular dynamics code ddcMD has been parallelized based on a hybrid MPI/OpenMP scheme. The explicit pair kernel of the particle- particle/particle-mesh algorithm is multi-threaded using OpenMP, while communication between multicore nodes is handled by MPI. We have designed a load balancing spanning forest (LBSF) partitioning algorithm, which combines: 1) fine- grain dynamic load balancing; and 2) minimal memory-footprint data privatization via nucleation-growth allocation. This algorithm reduces the memory requirement for thread-private data from O(np) to O(n + p 1/3 n 2/3 )—amounting to 75% memory saving for p = 16 threads working on n = 8,192 particles, while maintaining the average thread-level load-imbalance less than 5%. Strong-scaling speedup for the kernel is 14.4 with 16-way threading on a four quad-core AMD Opteron node. In addition, our MPI/OpenMP code shows 2.58! and 2.16! speedups over the MPI-only implementation, respectively, for 0.84 and 1.68 million particles systems on 32,768 cores of BlueGene/P.

[1]  T. Darden,et al.  Particle mesh Ewald: An N⋅log(N) method for Ewald sums in large systems , 1993 .

[2]  Laxmikant V. Kalé,et al.  NAMD: Biomolecular Simulation on Thousands of Processors , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[3]  John A. Gunnels,et al.  Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[4]  Samuel Williams,et al.  Memory-efficient optimization of Gyrokinetic particle-to-grid interpolation for multicore processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[5]  David W. Walker,et al.  Hybrid Message-Passing and Shared-Memory Programming in a Molecular Dynamics Application On Multicore Clusters , 2009, Int. J. High Perform. Comput. Appl..

[6]  Weiqiang Wang,et al.  A metascalable computing framework for large spatiotemporal-scale atomistic simulations , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[7]  Darrin M. York,et al.  The fast Fourier Poisson method for calculating Ewald sums , 1994 .

[8]  Carsten Kutzner,et al.  GROMACS 4:  Algorithms for Highly Efficient, Load-Balanced, and Scalable Molecular Simulation. , 2008, Journal of chemical theory and computation.

[9]  David E. Shaw,et al.  Zonal methods for the parallel execution of range-limited N-body simulations , 2007, J. Comput. Phys..

[10]  Sadaf R. Alam,et al.  Impact of multicores on large-scale molecular dynamics simulations , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[11]  Yunfei Chen,et al.  GPU accelerated molecular dynamics simulation of thermal conductivities , 2007, J. Comput. Phys..

[12]  Shigeomi Chono,et al.  GPU-accelerated molecular dynamics simulation for study of liquid crystalline flows , 2010, J. Comput. Phys..

[13]  Ümit V. Çatalyürek,et al.  Hypergraph-based Dynamic Load Balancing for Adaptive Scientific Computations , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[14]  John R. Williams,et al.  An events based algorithm for distributing concurrent tasks on multi-core architectures , 2010, Comput. Phys. Commun..

[15]  John L. Klepeis,et al.  Millisecond-scale molecular dynamics simulations on Anton , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[16]  Georg Hager,et al.  Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[17]  Weiqiang Wang,et al.  A Scalable Hierarchical Parallelization Framework for Molecular Dynamics Simulation on Multicore Clusters , 2009, PDPTA.

[18]  R W Hockney,et al.  Computer Simulation Using Particles , 1966 .

[19]  John A. Gunnels,et al.  Beyond homogeneous decomposition: scaling long-range forces on Massively Parallel Systems , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[20]  Yali Liu,et al.  Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms , 2009, 2009 International Conference on Parallel Processing Workshops.

[21]  Long Chen,et al.  Dynamic load balancing on single- and multi-GPU systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[22]  John A. Gunnels,et al.  Simulating solidification in metals at high pressure: The drive to petascale computing , 2006 .