CAST: Contraction Algorithm for Symmetric Tensors

Tensor contractions represent the most compute- intensive core kernels in ab initio computational quantum chemistry and nuclear physics. Symmetries in these tensor contractions make them difficult to load balance and scale to large distributed systems. In this paper, we develop an efficient and scalable algorithm to contract symmetric tensors. We introduce a novel approach that avoids data redistribution during contraction of symmetric tensors while also bypassing redundant storage and maintaining load balance. We present experimental results on two parallel supercomputers for several symmetric contractions that appear in the coupled cluster singles and doubles (CCSD) quantum chemistry method. We also present a novel approach to tensor redistribution that can take advantage of parallel hyperplanes when the initial distribution has replicated dimensions, and use collective broadcast when the final distribution has replicated dimensions, making the algorithm very efficient.

[1]  Martin D. Schatz Anatomy of Parallel Computation with Tensors FLAME Working Note # 72 Ph , 2013 .

[2]  J. Ramanujam,et al.  Performance modeling and optimization of parallel out-of-core tensor contractions , 2005, PPoPP.

[3]  Beverly A. Sanders,et al.  Software design of ACES III with the super instruction architecture , 2011 .

[4]  Sriram Krishnamoorthy,et al.  Scalable implementations of accurate excited-state coupled cluster theories: Application of high-level methods to porphyrin-based systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[5]  James Demmel,et al.  Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..

[6]  David E. Bernholdt,et al.  Space-time trade-off optimization for a class of electronic structure calculations , 2002, PLDI '02.

[7]  James Demmel,et al.  Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[8]  Tjerk P. Straatsma,et al.  NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations , 2010, Comput. Phys. Commun..

[9]  James Demmel,et al.  Improving communication performance in dense linear algebra via topology aware collectives , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[10]  Sriram Krishnamoorthy,et al.  A Communication-Optimal Framework for Contracting Distributed Tensors , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[12]  T. Crawford,et al.  An Introduction to Coupled Cluster Theory for Computational Chemists , 2007 .

[13]  J. Ramanujam,et al.  Global communication optimization for tensor contraction expressions under memory constraints , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[14]  Mark S. Gordon,et al.  Chapter 41 – Advances in electronic structure theory: GAMESS a decade later , 2005 .

[15]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[16]  James Demmel,et al.  Communication-Avoiding Parallel Strassen: Implementation and performance , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  J. Ramanujam,et al.  Loop optimization for a class of memory-constrained computations , 2001, ICS '01.

[18]  Kwang S. Kim,et al.  Theory and applications of computational chemistry : the first forty years , 2005 .

[19]  David E. Bernholdt,et al.  Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models , 2005, Proceedings of the IEEE.

[20]  James Demmel,et al.  Cyclops Tensor Framework: Reducing Communication and Eliminating Load Imbalance in Massively Parallel Contractions , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[21]  R. Bartlett,et al.  Coupled-cluster theory in quantum chemistry , 2007 .

[22]  Sriram Krishnamoorthy,et al.  A framework for load balancing of Tensor Contraction expressions via dynamic task partitioning , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[23]  David E. Bernholdt,et al.  Automated Operation Minimization of Tensor Contraction Expressions in Electronic Structure Calculations , 2005, International Conference on Computational Science.