Global communication optimization for tensor contraction expressions under memory constraints

The accurate modeling of the electronic structure of atoms and molecules involves computationally intensive tensor contractions involving large multi-dimensional arrays. The efficient computation of complex tensor contractions usually requires the generation of temporary intermediate arrays. These intermediates could be extremely large, but they can often be generated and used in batches through appropriate loop fusion transformations. To optimize the performance of such computations on parallel computers, the total amount of inter-processor communication must be minimized, subject to the available memory on each processor In this paper we address the memory-constrained communication minimization problem in the context of this class of computations. Based on a framework that models the relationship between loop fusion and memory usage, we develop an approach to identify the best combination of loop fusion and data partitioning that minimizes inter-processor communication cost without exceeding the per-processor memory limit. The effectiveness of the developed optimization approach is demonstrated on a computation representative of a component used in quantum chemistry suites.

[1]  Leonidas J. Guibas,et al.  Compilation and delayed evaluation in APL , 1978, POPL.

[2]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[3]  Lawrence Snyder,et al.  The implementation and evaluation of fusion and contraction in array languages , 1998, PLDI '98.

[4]  Gerald Baumgartner,et al.  Optimization of Memory Usage Requirement for a Class of Loops Implementing Multi-dimensional Integrals , 1999, LCPC.

[5]  Vivek Sarkar,et al.  Optimization of array accesses by collective loop transformations , 1991, ICS '91.

[6]  Gustavo E. Scuseria,et al.  Achieving Chemical Accuracy with Coupled-Cluster Theory , 1995 .

[7]  David E. Bernholdt,et al.  Space-time trade-off optimization for a class of electronic structure calculations , 2002, PLDI '02.

[8]  Ken Kennedy Fast greedy weighted fusion , 2000, ICS '00.

[9]  Ken Kennedy,et al.  Improving effective bandwidth through compiler enhancement of global and dynamic cache reuse , 2000 .

[10]  Chi-Chung Lam,et al.  Performance optimization of a class of loops implementing multidimensional integrals , 1999 .

[11]  David E. Bernholdt,et al.  Towards Automatic Synthesis of High-Performance Codes for Electronic Structure Calculations: Data Locality Optimization , 2001, HiPC.

[12]  Chi-Chung Lam,et al.  On Optimizing a Class of Multi-Dimensional Loops with Reductions for Parallel Execution , 1997, Parallel Process. Lett..

[13]  Kathryn S. McKinley,et al.  A Compiler Optimization Algorithm for Shared-Memory Multiprocessors , 1998, IEEE Trans. Parallel Distributed Syst..

[14]  Tarek S. Abdelrahman,et al.  Fusion of Loops for Parallelism and Locality , 1997, IEEE Trans. Parallel Distributed Syst..

[15]  Cheng Wang,et al.  Data locality enhancement by memory reduction , 2001, ICS '01.

[16]  P. Kollman,et al.  Encyclopedia of computational chemistry , 1998 .

[17]  V. Sarkar,et al.  Collective Loop Fusion for Array Contraction , 1992, LCPC.

[18]  Alain Darte On the Complexity of Loop Fusion , 2000, Parallel Comput..

[19]  Ken Kennedy,et al.  Improving register allocation for subscripted variables , 1990, PLDI '90.

[20]  Kathryn S. McKinley,et al.  A Parametrized Loop Fusion Algorithm for Improving Parallelism and Cache Locality , 1997, Comput. J..

[21]  Ken Kennedy,et al.  Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.