Performance characterization of global address space applications: a case study with NWChem

The use of global address space languages and one‐sided communication for complex applications is gaining attention in the parallel computing community. However, lack of good evaluative methods to observe multiple levels of performance makes it difficult to isolate the cause of performance deficiencies and to understand the fundamental limitations of system and application design for future improvement. NWChem is a popular computational chemistry package, which depends on the Global Arrays/Aggregate Remote Memory Copy Interface suite for partitioned global address space functionality to deliver high‐end molecular modeling capabilities. A workload characterization methodology was developed to support NWChem performance engineering on large‐scale parallel platforms. The research involved both the integration of performance instrumentation and measurement in the NWChem software, as well as the analysis of one‐sided communication performance in the context of NWChem workloads. Scaling studies were conducted for NWChem on Blue Gene/P and on two large‐scale clusters using different generation Infiniband interconnects and x86 processors. The performance analysis and results show how subtle changes in the runtime parameters related to the communication subsystem could have significant impact on performance behavior. The tool has successfully identified several algorithmic bottlenecks, which are already being tackled by computational chemists to improve NWChem performance. Copyright © 2011 John Wiley & Sons, Ltd.

[1]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[2]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[3]  M. Head‐Gordon,et al.  A fifth-order perturbation comparison of electron correlation theories , 1989 .

[4]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[5]  Katherine A. Yelick,et al.  Titanium: A High-performance Java Dialect , 1998, Concurr. Pract. Exp..

[6]  Trygve Helgaker,et al.  Molecular Electronic-Structure Theory: Helgaker/Molecular Electronic-Structure Theory , 2000 .

[7]  Jeff R. Hammond,et al.  Coupled-cluster response theory: parallel algorithms and novel applications , 2009 .

[8]  Mark S. Gordon,et al.  Parallel algorithm for integral transformations and GUGA MCSCF , 1994 .

[9]  Robert J. Harrison,et al.  Liquid water: obtaining the right answer for the right reasons , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[10]  Péter Kacsuk,et al.  Distributed and parallel systems: from instruction parallelism to cluster computing , 2000 .

[11]  Rick Kufrin Measuring and improving application performance with PerfSuite , 2005 .

[12]  Bernd Mohr,et al.  A Tool Framework for Static and Dynamic Analysis of Object-Oriented Software with Templates , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[13]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[14]  Guy L. Steele,et al.  Parallel Programming and Parallel Abstractions in Fortress , 2005, IEEE PACT.

[15]  Allen D. Malony,et al.  Design and Implementation of a Hybrid Parallel Performance Measurement System , 2010, 2010 39th International Conference on Parallel Processing.

[16]  Ibm Blue,et al.  Overview of the IBM Blue Gene/P Project , 2008, IBM J. Res. Dev..

[17]  Michael J. Frisch,et al.  An improved criterion for evaluating the efficiency of two-electron integral algorithms , 1993 .

[18]  Robyn R. Lutz,et al.  Generalized portable shmem library for high performance computing , 2003 .

[19]  Jürgen Gauss,et al.  Parallel Calculation of CCSD and CCSD(T) Analytic First and Second Derivatives. , 2008, Journal of chemical theory and computation.

[20]  Bernd Mohr,et al.  The Scalasca performance toolset architecture , 2010, Concurr. Comput. Pract. Exp..

[21]  Robert J. Harrison,et al.  Global Arrays: a portable "shared-memory" programming model for distributed memory computers , 1994, Proceedings of Supercomputing '94.

[22]  Robert J. Harrison,et al.  Computational chemistry at the petascale: Are we there yet? , 2009 .

[23]  David H. Bailey,et al.  The NAS parallel benchmarks summary and preliminary results , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[24]  Allen D. Malony,et al.  Instrumentation and Measurement Strategies for Flexible and Portable Empirical Performance Evaluation , 2001 .

[25]  Michael J. Frisch,et al.  Ab Initio Quantum Chemistry on a Workstation Cluster , 1995 .

[26]  R J Bartlett,et al.  Parallel implementation of electronic structure energy, gradient, and Hessian calculations. , 2008, The Journal of chemical physics.

[27]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[28]  R. Bartlett,et al.  Recursive intermediate factorization and complete computational linearization of the coupled-cluster single, double, triple, and quadruple excitation equations , 1991 .

[29]  Kaivalya M. Dixit,et al.  The SPEC benchmarks , 1991, Parallel Comput..

[30]  Jeffrey S. Vetter,et al.  Enabling a highly-scalable global address space model for petascale computing , 2010, CF '10.

[31]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[32]  Bryan Carpenter,et al.  ARMCI: A Portable Remote Memory Copy Libray for Ditributed Array Libraries and Compiler Run-Time Systems , 1999, IPPS/SPDP Workshops.

[33]  Robert J. Harrison,et al.  Parallel direct four-index transformations , 1996 .

[34]  Jarek Nieplocha,et al.  Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit , 2006, Int. J. High Perform. Comput. Appl..

[35]  Sriram Krishnamoorthy,et al.  Scalable work stealing , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[36]  Dieter Kranzlmüller,et al.  Tools for Scalable Parallel Program Analysis - Vampir VNG and DeWiz , 2004, DAPSYS.

[37]  Robert J. Fowler,et al.  HPCToolkit : Multi-platform Tools for Profile-based Performance Analysis , 2003 .

[38]  Philip Heidelberger,et al.  The deep computing messaging framework: generalized scalable message passing on the blue gene/P supercomputer , 2008, ICS '08.

[39]  Allen D. Malony,et al.  Performance Technology for Complex Parallel and Distributed Systems , 2000, Scalable Comput. Pract. Exp..

[40]  Katherine Yelick,et al.  UPC Language Specifications V1.1.1 , 2003 .

[41]  Allen D. Malony,et al.  Portable profiling and tracing for parallel, scientific applications using C++ , 1998, SPDT '98.

[42]  Alistair P. Rendell,et al.  A direct coupled cluster algorithm for massively parallel computers , 1997 .

[43]  M. Ratner Molecular electronic-structure theory , 2000 .

[44]  Guy L. Steele Parallel Programming and Parallel Abstractions in Fortress , 2005, IEEE PACT.

[45]  Mark S. Gordon,et al.  Parallel algorithm for integral transformations and GUGA MCSCF , 1994 .

[46]  Peter M. W. Gill,et al.  Molecular integrals Over Gaussian Basis Functions , 1994 .

[47]  J. Hammond,et al.  Coupled‐Cluster Calculations for Large Molecular and Extended Systems , 2011 .

[48]  Robert J. Harrison,et al.  Parallel computing in quantum chemistry - Message passing and beyond for a general ab initio program system , 1994, Future generations computer systems.

[49]  R. Harrison,et al.  AB Initio Molecular Electronic Structure on Parallel Computers , 1994 .

[50]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[51]  Mark S. Gordon,et al.  Coupled cluster algorithms for networks of shared memory parallel processors , 2007, Comput. Phys. Commun..

[52]  Jack J. Dongarra,et al.  A Portable Programming Interface for Performance Evaluation on Modern Processors , 2000, Int. J. High Perform. Comput. Appl..

[53]  Robert J. Harrison,et al.  Moving beyond message passing. Experiments with a distributed-data model , 1993 .