论文信息 - Scalable Data-Privatization Threading for Hybrid MPI/OpenMP Parallelization of Molecular Dynamics

Scalable Data-Privatization Threading for Hybrid MPI/OpenMP Parallelization of Molecular Dynamics

Calculation of the Coulomb potential in the molecular dynamics code ddcMD has been parallelized based on a hybrid MPI/OpenMP scheme. The explicit pair kernel of the particle- particle/particle-mesh algorithm is multi-threaded using OpenMP, while communication between multicore nodes is handled by MPI. We have designed a load balancing spanning forest (LBSF) partitioning algorithm, which combines: 1) fine- grain dynamic load balancing; and 2) minimal memory-footprint data privatization via nucleation-growth allocation. This algorithm reduces the memory requirement for thread-private data from O(np) to O(n + p 1/3 n 2/3 )—amounting to 75% memory saving for p = 16 threads working on n = 8,192 particles, while maintaining the average thread-level load-imbalance less than 5%. Strong-scaling speedup for the kernel is 14.4 with 16-way threading on a four quad-core AMD Opteron node. In addition, our MPI/OpenMP code shows 2.58! and 2.16! speedups over the MPI-only implementation, respectively, for 0.84 and 1.68 million particles systems on 32,768 cores of BlueGene/P.

[1] T. Darden,et al. Particle mesh Ewald: An N⋅log(N) method for Ewald sums in large systems , 1993 .

[2] Laxmikant V. Kalé,et al. NAMD: Biomolecular Simulation on Thousands of Processors , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[3] John A. Gunnels,et al. Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[4] Samuel Williams,et al. Memory-efficient optimization of Gyrokinetic particle-to-grid interpolation for multicore processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[5] David W. Walker,et al. Hybrid Message-Passing and Shared-Memory Programming in a Molecular Dynamics Application On Multicore Clusters , 2009, Int. J. High Perform. Comput. Appl..

[6] Weiqiang Wang,et al. A metascalable computing framework for large spatiotemporal-scale atomistic simulations , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[7] Darrin M. York,et al. The fast Fourier Poisson method for calculating Ewald sums , 1994 .

[8] Carsten Kutzner,et al. GROMACS 4: Algorithms for Highly Efficient, Load-Balanced, and Scalable Molecular Simulation. , 2008, Journal of chemical theory and computation.

[9] David E. Shaw,et al. Zonal methods for the parallel execution of range-limited N-body simulations , 2007, J. Comput. Phys..

[10] Sadaf R. Alam,et al. Impact of multicores on large-scale molecular dynamics simulations , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[11] Yunfei Chen,et al. GPU accelerated molecular dynamics simulation of thermal conductivities , 2007, J. Comput. Phys..

[12] Shigeomi Chono,et al. GPU-accelerated molecular dynamics simulation for study of liquid crystalline flows , 2010, J. Comput. Phys..

[13] Ümit V. Çatalyürek,et al. Hypergraph-based Dynamic Load Balancing for Adaptive Scientific Computations , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[14] John R. Williams,et al. An events based algorithm for distributing concurrent tasks on multi-core architectures , 2010, Comput. Phys. Commun..

[15] John L. Klepeis,et al. Millisecond-scale molecular dynamics simulations on Anton , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[16] Georg Hager,et al. Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[17] Weiqiang Wang,et al. A Scalable Hierarchical Parallelization Framework for Molecular Dynamics Simulation on Multicore Clusters , 2009, PDPTA.

[18] R W Hockney,et al. Computer Simulation Using Particles , 1966 .

[19] John A. Gunnels,et al. Beyond homogeneous decomposition: scaling long-range forces on Massively Parallel Systems , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[20] Yali Liu,et al. Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms , 2009, 2009 International Conference on Parallel Processing Workshops.

[21] Long Chen,et al. Dynamic load balancing on single- and multi-GPU systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[22] John A. Gunnels,et al. Simulating solidification in metals at high pressure: The drive to petascale computing , 2006 .