A Communication-Optimal Framework for Contracting Distributed Tensors

Tensor contractions are extremely compute intensive generalized matrix multiplication operations encountered in many computational science fields, such as quantum chemistry and nuclear physics. Unlike distributed matrix multiplication, which has been extensively studied, limited work has been done in understanding distributed tensor contractions. In this paper, we characterize distributed tensor contraction algorithms on torus networks. We develop a framework with three fundamental communication operators to generate communication-efficient contraction algorithms for arbitrary tensor contractions. We show that for a given amount of memory per processor, the framework is communication optimal for all tensor contractions. We demonstrate performance and scalability of the framework on up to 262,144 cores on a Blue Gene/Q supercomputer.

[1]  Sriram Krishnamoorthy,et al.  Scalable implementations of accurate excited-state coupled cluster theories: Application of high-level methods to porphyrin-based systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[2]  J. Ramanujam,et al.  Loop optimization for a class of memory-constrained computations , 2001, ICS '01.

[3]  Barbara M. Chapman,et al.  Performance Analysis of the NWChem TCE for Different Communication Patterns , 2013, PMBS@SC.

[4]  Sriram Krishnamoorthy,et al.  A framework for load balancing of Tensor Contraction expressions via dynamic task partitioning , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[5]  Sriram Krishnamoorthy,et al.  Supporting the Global Arrays PGAS Model Using MPI One-Sided Communication , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[6]  David E. Bernholdt,et al.  Automated Operation Minimization of Tensor Contraction Expressions in Electronic Structure Calculations , 2005, International Conference on Computational Science.

[7]  Jarek Nieplocha,et al.  Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit , 2006, Int. J. High Perform. Comput. Appl..

[8]  David E. Bernholdt,et al.  Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models , 2005, Proceedings of the IEEE.

[9]  Robert J. Harrison,et al.  Shared Memory Programming in Metacomputing Environments: The Global Array Approach , 1997, The Journal of Supercomputing.

[10]  Hyuk-Jae Lee,et al.  Generalized Cannon's algorithm for parallel matrix multiplication , 1997, ICS '97.

[11]  James Demmel,et al.  Cyclops Tensor Framework: Reducing Communication and Eliminating Load Imbalance in Massively Parallel Contractions , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[12]  J. Ramanujam,et al.  Global communication optimization for tensor contraction expressions under memory constraints , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[13]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[14]  James Demmel,et al.  Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..

[15]  R. Bartlett,et al.  Coupled-cluster theory in quantum chemistry , 2007 .

[16]  J. Ramanujam,et al.  Performance modeling and optimization of parallel out-of-core tensor contractions , 2005, PPoPP.

[17]  James Demmel,et al.  Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[18]  Mark S. Gordon,et al.  Chapter 41 – Advances in electronic structure theory: GAMESS a decade later , 2005 .

[19]  Beverly A. Sanders,et al.  Software design of ACES III with the super instruction architecture , 2011 .

[20]  Tjerk P. Straatsma,et al.  NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations , 2010, Comput. Phys. Commun..

[21]  James Demmel,et al.  Improving communication performance in dense linear algebra via topology aware collectives , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[22]  David E. Bernholdt,et al.  Space-time trade-off optimization for a class of electronic structure calculations , 2002, PLDI '02.

[23]  T. Crawford,et al.  An Introduction to Coupled Cluster Theory for Computational Chemists , 2007 .

[24]  Ramesh C. Agarwal,et al.  A three-dimensional approach to parallel matrix multiplication , 1995, IBM J. Res. Dev..

[25]  Robert A. van de Geijn,et al.  SUMMA: Scalable Universal Matrix Multiplication Algorithm , 1995 .

[26]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[27]  James Demmel,et al.  Communication-Avoiding Parallel Strassen: Implementation and performance , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.