Exploiting hierarchical parallelisms for molecular dynamics simulation on multicore clusters

We have developed a scalable hierarchical parallelization scheme for molecular dynamics (MD) simulation on multicore clusters. The scheme explores multilevel parallelism combining: (1) Internode parallelism using spatial decomposition via message passing; (2) intercore parallelism using cellular decomposition via multithreading employing a master/worker model; (3) data-level optimization via single-instruction multiple-data (SIMD) parallelism with various code transformation techniques. By using a hierarchy of parallelisms, the scheme exposes very high concurrency and data locality, thereby achieving: (1) internode weak-scaling parallel efficiency 0.985 on 106,496 BlueGene/L nodes (0.975 on 32,768 BlueGene/P nodes), internode strong-scaling parallel efficiency 0.90 on 8,192 BlueGene/L nodes; (2) intercore multithread parallel efficiency 0.65 for eight threads on a dual quadcore Xeon platform; and (3) SIMD speedup around 2 for problem sizes ranging from 3,072 to 98,304 atoms. Furthermore, the effect of memory-access penalty on SIMD performance is analyzed, and an application-based SIMD analysis scheme is proposed to help programmers determine whether their applications are amenable to SIMDization.

[1]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[2]  Erik R. Altman Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems , 2008 .

[3]  Subhash Saini,et al.  Scalable atomistic simulation algorithms for materials research , 2001, SC.

[4]  Liu Peng,et al.  High-order stencil computations on multicore clusters , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[5]  William J. Dally,et al.  Analysis and Performance Results of a Molecular Modeling Application on Merrimac , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[6]  William J. Dally,et al.  Executing irregular scientific applications on stream architectures , 2007, ICS '07.

[7]  Laxmikant V. Kalé,et al.  NAMD: Biomolecular Simulation on Thousands of Processors , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[8]  Peng Wu,et al.  Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.

[9]  Yves Robert,et al.  On the Alignment Problem , 1994, Parallel Process. Lett..

[10]  V.K. Prasanna,et al.  Preliminary Investigation of Advanced Electrostatics in Molecular Dynamics on Reconfigurable Computers , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[11]  J. Dongarra,et al.  The Impact of Multicore on Computational Science Software , 2007 .

[12]  Toshikazu Ebisuzaki,et al.  A 281 Tflops calculation for X-ray protein structure analysis with special-purpose computers MDGRAPE-3 , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[13]  Jung Ho Ahn,et al.  Merrimac: Supercomputing with Streams , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[14]  Rajiv K. Kalia,et al.  Multimillion atom simulation of materials on parallel computers - Nanopixel, interfacial fracture, nanoindentation, and oxidation , 2001 .

[15]  José E. Moreira,et al.  Demonstrating the scalability of a molecular dynamics application on a Petaflop computer , 2001, ICS '01.

[16]  Wonyong Sung,et al.  Efficient vectorization of SIMD programs with non-aligned and irregular data access hardware , 2008, CASES '08.

[17]  Weiqiang Wang,et al.  A metascalable computing framework for large spatiotemporal-scale atomistic simulations , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.