A framework for load balancing of Tensor Contraction expressions via dynamic task partitioning

In this paper, we introduce the Dynamic Load-balanced Tensor Contractions (DLTC), a domain-specific library for efficient task parallel execution of tensor contraction expressions, a class of computation encountered in quantum chemistry and physics. Our framework decomposes each contraction into smaller unit of tasks, represented by an abstraction referred to as iterators. We exploit an extra level of parallelism by having tasks across independent contractions executed concurrently through a dynamic load balancing runtime. We demonstrate the improved performance, scalability, and flexibility for the computation of tensor contraction expressions on parallel computers using examples from Coupled Cluster (CC) methods.

[1]  David E. Bernholdt,et al.  Space-time trade-off optimization for a class of electronic structure calculations , 2002, PLDI '02.

[2]  Mitsuhiko Toda,et al.  Methods for Visual Understanding of Hierarchical System Structures , 1981, IEEE Transactions on Systems, Man, and Cybernetics.

[3]  Tjerk P. Straatsma,et al.  NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations , 2010, Comput. Phys. Commun..

[4]  Jarek Nieplocha,et al.  Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit , 2006, Int. J. High Perform. Comput. Appl..

[5]  Oliver Bastert,et al.  Layered Drawings of Digraphs , 1999, Drawing Graphs.

[6]  Dhabaleswar K. Panda,et al.  High Performance Remote Memory Access Communication: The Armci Approach , 2006, Int. J. High Perform. Comput. Appl..

[7]  David E. Bernholdt,et al.  Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models , 2005, Proceedings of the IEEE.

[8]  James Demmel,et al.  Cyclops Tensor Framework: Reducing Communication and Eliminating Load Imbalance in Massively Parallel Contractions , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[9]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[10]  Sriram Krishnamoorthy,et al.  Scalable implementations of accurate excited-state coupled cluster theories: Application of high-level methods to porphyrin-based systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[11]  S. Hirata Tensor Contraction Engine: Abstraction and Automated Parallel Implementation of Configuration-Interaction, Coupled-Cluster, and Many-Body Perturbation Theories , 2003 .

[12]  Kurt Mehlhorn,et al.  Graph Algorithm and NP-Completeness , 1984 .

[13]  David E. Bernholdt,et al.  Identifying Cost-Effective Common Subexpressions to Reduce Operation Count in Tensor Contraction Evaluations , 2006, International Conference on Computational Science.

[14]  T. Crawford,et al.  An Introduction to Coupled Cluster Theory for Computational Chemists , 2007 .

[15]  Sriram Krishnamoorthy,et al.  Scioto: A Framework for Global-View Task Parallelism , 2008, 2008 37th International Conference on Parallel Processing.

[16]  Ronald L. Graham,et al.  Optimal scheduling for two-processor systems , 1972, Acta Informatica.

[17]  J. Ramanujam,et al.  Loop optimization for a class of memory-constrained computations , 2001, ICS '01.

[18]  Laxmikant V. Kalé,et al.  Work stealing and persistence-based load balancers for iterative overdecomposed applications , 2012, HPDC '12.

[19]  Sriram Krishnamoorthy,et al.  Scalable work stealing , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[20]  David E. Bernholdt,et al.  Automated Operation Minimization of Tensor Contraction Expressions in Electronic Structure Calculations , 2005, International Conference on Computational Science.

[21]  Kurt Mehlhorn,et al.  Data Structures and Algorithms 2: Graph Algorithms and NP-Completeness , 1984, EATCS Monographs on Theoretical Computer Science.

[22]  P. Sadayappan,et al.  Effective Utilization of Tensor Symmetry in Operation Optimization of Tensor Contraction Expressions , 2012 .

[23]  Pavan Balaji,et al.  Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions , 2013, 2013 42nd International Conference on Parallel Processing.

[24]  David E. Bernholdt,et al.  Automatic code generation for many-body electronic structure methods: the tensor contraction engine , 2006 .