A massively parallel tensor contraction framework for coupled-cluster computations

Precise calculation of molecular electronic wavefunctions by methods such as coupled-cluster requires the computation of tensor contractions, the cost of which has polynomial computational scaling with respect to the system and basis set sizes. Each contraction may be executed via matrix multiplication on a properly ordered and structured tensor. However, data transpositions are often needed to reorder the tensors for each contraction. Writing and optimizing distributed-memory kernels for each transposition and contraction is tedious since the number of contractions scales combinatorially with the number of tensor indices. We present a distributed-memory numerical library (Cyclops Tensor Framework (CTF)) that automatically manages tensor blocking and redistribution to perform any user-specified contractions. CTF serves as the distributed-memory contraction engine in Aquarius, a new program designed for high-accuracy and massively-parallel quantum chemical computations. Aquarius implements a range of coupled-cluster and related methods such as CCSD and CCSDT by writing the equations on top of a C++ templated domain-specific language. This DSL calls CTF directly to manage the data and perform the contractions. Our CCSD and CCSDT implementations achieve high parallel scalability on the BlueGene/Q and Cray XC30 supercomputer architectures showing that accurate electronic structure calculations can be effectively carried out on top of general distributed-memory tensor primitives. We introduce Cyclops Tensor Framework (CTF), a distributed-memory library for tensor contractions.CTF is able to perform tensor decomposition, redistribution, and contraction at runtime.CTF enables the expression of massively-parallel coupled-cluster methods via a concise tensor contraction interface.The quantum chemistry software suite Aquarius employs CTF to execute two coupled-cluster methods: CCSD and CCSDT.The Aquarius CCSD and CCSDT codes scale well on BlueGene/Q and Cray XC30, comparing favorably to NWChem.

[1]  Daniel Kats,et al.  Sparse tensor framework for implementation of general local correlation methods. , 2013, The Journal of chemical physics.

[2]  John F. Stanton,et al.  Analytic second derivatives in high-order many-body perturbation and coupled-cluster theories: Computational considerations and applications , 2000 .

[3]  Scott Lathrop,et al.  Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis , 2011, International Conference on High Performance Computing.

[4]  Michael Gschwind,et al.  The IBM Blue Gene/Q Compute Chip , 2012, IEEE Micro.

[5]  Martin Head-Gordon,et al.  A sparse framework for the derivation and implementation of fermion algebra , 2010 .

[6]  H. Monkhorst,et al.  Calculation of properties with the coupled-cluster method , 2009 .

[7]  James Demmel,et al.  Improving communication performance in dense linear algebra via topology aware collectives , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[8]  Alexander Tiskin,et al.  Memory-Efficient Matrix Multiplication in the BSP Model , 1999, Algorithmica.

[9]  R. Bartlett,et al.  Recursive intermediate factorization and complete computational linearization of the coupled-cluster single, double, triple, and quadruple excitation equations , 1991 .

[10]  James Demmel,et al.  Minimizing Communication in Linear Algebra , 2009, ArXiv.

[11]  T. Crawford,et al.  An Introduction to Coupled Cluster Theory for Computational Chemists , 2007 .

[12]  Ramesh C. Agarwal,et al.  A three-dimensional approach to parallel matrix multiplication , 1995, IBM J. Res. Dev..

[13]  Michael Hanrath,et al.  An efficient matrix-matrix multiplication based antisymmetric tensor contraction engine for general order coupled cluster. , 2010, The Journal of chemical physics.

[14]  Peter J. Knowles,et al.  A new determinant-based full configuration interaction method , 1984 .

[15]  Sriram Krishnamoorthy,et al.  Efficient Search-Space Pruning for Integrated Fusion and Tiling Transformations , 2005, LCPC.

[16]  Jack Dongarra,et al.  ScaLAPACK user's guide , 1997 .

[17]  R. Bartlett,et al.  A full coupled‐cluster singles and doubles model: The inclusion of disconnected triples , 1982 .

[18]  James Demmel,et al.  Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[19]  Jeppe Olsen,et al.  Determinant based configuration interaction algorithms for complete and restricted configuration interaction spaces , 1988 .

[20]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[21]  S. Lennart Johnsson,et al.  Minimizing the Communication Time for Matrix Multiplication on Multiprocessors , 1993, Parallel Comput..

[22]  R. Bartlett,et al.  Coupled-cluster theory in quantum chemistry , 2007 .

[23]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .

[24]  Christina Freytag,et al.  Using Mpi Portable Parallel Programming With The Message Passing Interface , 2016 .

[25]  James Demmel,et al.  Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..

[26]  James Demmel,et al.  Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds , 2012, SPAA '12.

[27]  R. K. Nesbet,et al.  Self‐Consistent Orbitals for Radicals , 1954 .

[28]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[29]  James Demmel,et al.  Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[30]  Sriram Krishnamoorthy,et al.  A framework for load balancing of Tensor Contraction expressions via dynamic task partitioning , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[31]  Julia E. Rice,et al.  Analytic evaluation of energy gradients for the single and double excitation coupled cluster (CCSD) wave function: Theory and application , 1987 .

[32]  G. Scuseria,et al.  Is coupled cluster singles and doubles (CCSD) more computationally intensive than quadratic configuration interaction (QCISD) , 1989 .

[33]  S. Hirata Tensor Contraction Engine: Abstraction and Automated Parallel Implementation of Configuration-Interaction, Coupled-Cluster, and Many-Body Perturbation Theories , 2003 .

[34]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[35]  David E. Bernholdt,et al.  Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models , 2005, Proceedings of the IEEE.

[36]  Clemens C. J. Roothaan,et al.  New Developments in Molecular Orbital Theory , 1951 .

[37]  J. Cizek On the Correlation Problem in Atomic and Molecular Systems. Calculation of Wavefunction Components in Ursell-Type Expansion Using Quantum-Field Theoretical Methods , 1966 .

[38]  Robert A. van de Geijn,et al.  Elemental: A New Framework for Distributed Memory Dense Matrix Computations , 2013, TOMS.

[39]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[40]  Robert J. Harrison,et al.  Global arrays: A nonuniform memory access programming model for high-performance computers , 1996, The Journal of Supercomputing.

[41]  William Gropp,et al.  Skjellum using mpi: portable parallel programming with the message-passing interface , 1994 .

[42]  Philip Heidelberger,et al.  The IBM Blue Gene/Q interconnection network and message unit , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[43]  R. Parr Density-functional theory of atoms and molecules , 1989 .

[44]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[45]  Sartaj Sahni,et al.  Parallel Matrix and Graph Algorithms , 1981, SIAM J. Comput..

[46]  Evgeny Epifanovsky,et al.  New implementation of high‐level correlated methods using a general block tensor library for high‐performance electronic structure calculations , 2013, J. Comput. Chem..

[47]  R. Bartlett,et al.  A coupled cluster approach with triple excitations , 1984 .

[48]  Mihály Kállay,et al.  Higher excitations in coupled-cluster theory , 2001 .

[49]  John F. Stanton,et al.  The equation of motion coupled‐cluster method. A systematic biorthogonal approach to molecular excitation energies, transition probabilities, and excited state properties , 1993 .

[50]  Shawn T. Brown,et al.  Advances in methods and algorithms in a modern quantum chemistry program package. , 2006, Physical chemistry chemical physics : PCCP.

[51]  R. Bartlett,et al.  The full CCSDT model for molecular electronic structure , 1987 .

[52]  Beverly A. Sanders,et al.  Software design of ACES III with the super instruction architecture , 2011 .

[53]  Alok Aggarwal,et al.  Communication Complexity of PRAMs , 1990, Theor. Comput. Sci..

[54]  Beverly A. Sanders,et al.  An infrastructure for scalable and portable parallel programs for computational chemistry , 2009, ICS '09.