Optimizing the Four-Index Integral Transform Using Data Movement Lower Bounds Analysis

The four-index integral transform is a fundamental and computationally demanding calculation used in many computational chemistry suites such as NWChem. It transforms a four-dimensional tensor from one basis to another. This transformation is most efficiently implemented as a sequence of four tensor contractions that each contract a four- dimensional tensor with a two-dimensional transformation matrix. Differing degrees of permutation symmetry in the intermediate and final tensors in the sequence of contractions cause intermediate tensors to be much larger than the final tensor and limit the number of electronic states in the modeled systems. Loop fusion, in conjunction with tiling, can be very effective in reducing the total space requirement, as well as data movement. However, the large number of possible choices for loop fusion and tiling, and data/computation distribution across a parallel system, make it challenging to develop an optimized parallel implementation for the four-index integral transform. We develop a novel approach to address this problem, using lower bounds modeling of data movement complexity. We establish relationships between available aggregate physical memory in a parallel computer system and ineffective fusion configurations, enabling their pruning and consequent identification of effective choices and a characterization of optimality criteria. This work has resulted in the development of a significantly improved implementation of the four-index transform that enables higher performance and the ability to model larger electronic systems than the current implementation in the NWChem quantum chemistry software suite.

[1]  Sriram Krishnamoorthy,et al.  Practical Loop Transformations for Tensor Contraction Expressions on Multi-level Memory Hierarchies , 2011, CC.

[2]  Guntram Rauhut,et al.  Integral transformation with low‐order scaling for large local second‐order Møller–Plesset calculations , 1998 .

[3]  Mark S. Gordon,et al.  General atomic and molecular electronic structure system , 1993, J. Comput. Chem..

[4]  Lawrence A. Covick,et al.  Four‐Index transformation on distributed‐memory parallel computers , 1990 .

[5]  Mark S. Gordon,et al.  Parallel algorithm for integral transformations and GUGA MCSCF , 1994 .

[6]  Mark S. Gordon,et al.  DEVELOPMENTS IN PARALLEL ELECTRONIC STRUCTURE THEORY , 2007 .

[7]  M. Pernpointner,et al.  Parallelization of four‐component calculations. I. Integral generation, SCF, and four‐index transformation in the Dirac–Fock package MOLFDIR , 2000, J. Comput. Chem..

[8]  Sriram Krishnamoorthy,et al.  Integrated Loop Optimizations for Data Locality Enhancement of Tensor Contraction Expressions , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[9]  Gianfranco Bilardi,et al.  A Characterization of Temporal Locality and Its Portability across Memory Hierarchies , 2001, ICALP.

[10]  Lucas Visscher,et al.  Parallelization of four-component calculations. I. Integral generation, SCF, and four-index transformation in the Dirac-Fock package MOLFDIR , 2000, J. Comput. Chem..

[11]  Matthew L. Leininger,et al.  Psi4: an open‐source ab initio electronic structure program , 2012 .

[12]  Robert J. Harrison,et al.  Parallel direct four-index transformations , 1996 .

[13]  Svein Saebo,et al.  Avoiding the integral storage bottleneck in LCAO calculations of electron correlation , 1989 .

[14]  Guntram Rauhut,et al.  Integral transformation with low-order scaling for large local second-order Møller-Plesset calculations , 1998, J. Comput. Chem..

[15]  Jarek Nieplocha,et al.  Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit , 2006, Int. J. High Perform. Comput. Appl..

[16]  S. Wilson Four-Index Transformations , 1987 .

[17]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[18]  K. Hirao,et al.  A four-index transformation in Dirac's four-component relativistic theory , 2004 .

[19]  Thomas Rauber,et al.  Memory-optimal evaluation of expression trees involving large objects , 1999, Comput. Lang. Syst. Struct..

[20]  Henry F. Schaefer,et al.  Parallel algorithms for quantum chemistry. I. Integral transformations on a hypercube multiprocessor , 1987 .

[21]  Shridhar R. Gadre,et al.  A general parallel solution to the integral transformation and second‐order Mo/ller–Plesset energy evaluation on distributed memory parallel machines , 1994 .

[22]  Yves Robert,et al.  Matrix product on heterogeneous master-worker platforms , 2008, PPoPP.

[23]  Thomas R. Furlani,et al.  Implementation of a parallel direct SCF algorithm on distributed memory computers , 1995, J. Comput. Chem..

[24]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[25]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[26]  Sriram Krishnamoorthy,et al.  Efficient Search-Space Pruning for Integrated Fusion and Tiling Transformations , 2005, LCPC.

[27]  Tjerk P. Straatsma,et al.  NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations , 2010, Comput. Phys. Commun..