Speeding up parallel GROMACS on high‐latency networks

We investigate the parallel scaling of the GROMACS molecular dynamics code on Ethernet Beowulf clusters and what prerequisites are necessary for decent scaling even on such clusters with only limited bandwidth and high latency. GROMACS 3.3 scales well on supercomputers like the IBM p690 (Regatta) and on Linux clusters with a special interconnect like Myrinet or Infiniband. Because of the high single‐node performance of GROMACS, however, on the widely used Ethernet switched clusters, the scaling typically breaks down when more than two computer nodes are involved, limiting the absolute speedup that can be gained to about 3 relative to a single‐CPU run. With the LAM MPI implementation, the main scaling bottleneck is here identified to be the all‐to‐all communication which is required every time step. During such an all‐to‐all communication step, a huge amount of messages floods the network, and as a result many TCP packets are lost. We show that Ethernet flow control prevents network congestion and leads to substantial scaling improvements. For 16 CPUs, e.g., a speedup of 11 has been achieved. However, for more nodes this mechanism also fails. Having optimized an all‐to‐all routine, which sends the data in an ordered fashion, we show that it is possible to completely prevent packet loss for any number of multi‐CPU nodes. Thus, the GROMACS scaling dramatically improves, even for switches that lack flow control. In addition, for the common HP ProCurve 2848 switch we find that for optimum all‐to‐all performance it is essential how the nodes are connected to the switch's ports. This is also demonstrated for the example of the Car‐Parinello MD code. © 2007 Wiley Periodicals, Inc. J Comput Chem, 2007

[1]  Laxmikant V. Kalé,et al.  A framework for collective personalized communication , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[2]  Andrew Lumsdaine,et al.  A Component Architecture for LAM/MPI , 2003, PVM/MPI.

[3]  Eli Upfal,et al.  Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[4]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[5]  Johannes Grotendorst,et al.  Modern methods and algorithms of quantum chemistry , 2000 .

[6]  Car,et al.  Unified approach for molecular dynamics and density-functional theory. , 1985, Physical review letters.

[7]  Laxmikant V. Kalé,et al.  Scalable molecular dynamics with NAMD , 2005, J. Comput. Chem..

[8]  Xin Yuan,et al.  An MPI prototype for compiled communication on Ethernet switched clusters , 2005, J. Parallel Distributed Comput..

[9]  Greg Burns,et al.  LAM: An Open Cluster Environment for MPI , 2002 .

[10]  Mohan Kalkunte,et al.  Gigabit ethernet: migrating to high-bandwidth LANs , 1998 .

[11]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[12]  Gerrit Groenhof,et al.  GROMACS: Fast, flexible, and free , 2005, J. Comput. Chem..

[13]  Wouter Joosen,et al.  Parallel computing : from theory to sound practice : proceedings of EWPC '92, the European Workshops on Parallel Computing, 23-24 March 1992, Barcelona, Spain , 1992 .

[14]  William Gropp,et al.  Toward Scalable Performance Visualization with Jumpshot , 1999, Int. J. High Perform. Comput. Appl..

[15]  Xin Yuan,et al.  An empirical approach for efficient all-to-all personalized communication on Ethernet switched clusters , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[16]  T. Darden,et al.  Particle mesh Ewald: An N⋅log(N) method for Ewald sums in large systems , 1993 .

[17]  A. Becke,et al.  Density-functional exchange-energy approximation with correct asymptotic behavior. , 1988, Physical review. A, General physics.

[18]  Rich Seifert,et al.  Gigabit Ethernet: Technology and Applications for High-Speed LANs , 1998 .

[19]  Berk Hess,et al.  GROMACS 3.0: a package for molecular simulation and trajectory analysis , 2001 .

[20]  A. Curioni,et al.  Car-Parrinello molecular dynamics on massively parallel computers. , 2005, Chemphyschem : a European journal of chemical physics and physical chemistry.

[21]  David Fincham,et al.  Systolic loop methods for molecular dynamics simulation using multiple transputers , 1989 .

[22]  T. Darden,et al.  A smooth particle mesh Ewald method , 1995 .

[23]  Ronald M. Welch,et al.  Climatic Impact of Tropical Lowland Deforestation on Nearby Montane Cloud Forests , 2001, Science.