Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models

This paper provides an overview of a program synthesis system for a class of quantum chemistry computations. These computations are expressible as a set of tensor contractions and arise in electronic structure modeling. The input to the system is a a high-level specification of the computation, from which the system can synthesize high-performance parallel code tailored to the characteristics of the target architecture. Several components of the synthesis system are described, focusing on performance optimization issues that they address.

[1]  Gang Ren,et al.  A comparison of empirical and model-driven optimization , 2003, PLDI '03.

[2]  Ken Kennedy,et al.  Telescoping Languages: A Strategy for Automatic Generation of Scientific Problem-Solving Systems from Annotated Libraries , 2001, J. Parallel Distributed Comput..

[3]  P. Kollman,et al.  Encyclopedia of computational chemistry , 1998 .

[4]  Sharad Malik,et al.  Precise miss analysis for program transformations with caches of arbitrary associativity , 1998, ASPLOS VIII.

[5]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[6]  David E. Bernholdt,et al.  Memory-Constrained Data Locality Optimization for Tensor Contractions , 2003, LCPC.

[7]  Robert J. Harrison,et al.  Global Arrays: a portable "shared-memory" programming model for distributed memory computers , 1994, Proceedings of Supercomputing '94.

[8]  V. Sarkar,et al.  Collective Loop Fusion for Array Contraction , 1992, LCPC.

[9]  David A. Padua,et al.  SPL: a language and compiler for DSP algorithms , 2001, PLDI '01.

[10]  Tarek S. Abdelrahman,et al.  Fusion of Loops for Parallelism and Locality , 1997, IEEE Trans. Parallel Distributed Syst..

[11]  Chi-Chung Lam,et al.  On Optimizing a Class of Multi-Dimensional Loops with Reductions for Parallel Execution , 1997, Parallel Process. Lett..

[12]  PingaliKeshav,et al.  A case for source-level transformations in MATLAB , 1999 .

[13]  Clemens Grelck,et al.  With-Loop Fusion for Data Locality and Parallelism , 2005, IFL.

[14]  Keshav Pingali,et al.  A case for source-level transformations in MATLAB , 1999, DSL '99.

[15]  Keshav Pingali,et al.  Synthesizing transformations for locality enhancement of imperfectly-nested loop nests , 2000 .

[16]  Wei Li,et al.  Compiling for NUMA Parallel Machines , 1993 .

[17]  Mark S. Gordon,et al.  General atomic and molecular electronic structure system , 1993, J. Comput. Chem..

[18]  Alan Edelman,et al.  Parallel MATLAB: Doing it Right , 2005, Proceedings of the IEEE.

[19]  So Hirata,et al.  Third-order Douglas-Kroll relativistic coupled-cluster theory through connected single, double, triple, and quadruple substitutions: applications to diatomic and triatomic hydrides. , 2004, The Journal of chemical physics.

[20]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.

[21]  Larry Carter,et al.  Quantifying the Multi-Level Nature of Tiling Interactions , 1997, International Journal of Parallel Programming.

[22]  Keshav Pingali,et al.  Synthesizing Transformations for Locality Enhancement of Imperfectly-Nested Loop Nests , 2001, International Journal of Parallel Programming.

[23]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[24]  Cheng Wang,et al.  Locality Enhancement by Array Contraction , 2001, LCPC.

[25]  J. Ramanujam,et al.  Loop optimization for a class of memory-constrained computations , 2001, ICS '01.

[26]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[27]  J. Ramanujam,et al.  Memory-Constrained Communication Minimization for a Class of Array Computations , 2002, LCPC.

[28]  Æleen Frisch,et al.  Exploring chemistry with electronic structure methods , 1996 .

[29]  David A. Padua,et al.  A MATLAB to Fortran 90 translator and its effectiveness , 1996, ICS '96.

[30]  Ken Kennedy,et al.  Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.

[31]  Chi-Chung Lam,et al.  Optimization of a Class of Multi-Dimensional Integrals on Parallel Machines , 1997, PPSC.

[32]  Anne Mignotte,et al.  Loop alignment for memory accesses optimization , 1999, Proceedings 12th International Symposium on System Synthesis.

[33]  Mahmut T. Kandemir,et al.  Reducing memory requirements of nested loops for embedded systems , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[34]  Keshav Pingali,et al.  An experimental evaluation of tiling and shackling for memory hierarchy management , 1999, ICS '99.

[35]  Keshav Pingali,et al.  High-level semantic optimization of numerical codes , 1999, ICS '99.

[36]  Zhiyuan Li,et al.  New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[37]  Gang Ren,et al.  Is Search Really Necessary to Generate High-Performance BLAS? , 2005, Proceedings of the IEEE.

[38]  Jan M. L. Martin Benchmark Studies on Small Molecules , 2002 .

[39]  Chi-Chung Lam,et al.  Performance optimization of a class of loops implementing multidimensional integrals , 1999 .

[40]  Kathryn S. McKinley,et al.  Loop Fusion for Data Locality and Parallelism , 1996 .

[41]  Francky Catthoor,et al.  Data dependency size estimation for use in memory optimization , 2003, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[42]  S. Hirata Tensor Contraction Engine: Abstraction and Automated Parallel Implementation of Configuration-Interaction, Coupled-Cluster, and Many-Body Perturbation Theories , 2003 .

[43]  David E. Bernholdt,et al.  Data Locality Optimization for Synthesis of Efficient Out-of-Core Algorithms , 2003, HiPC.

[44]  Gerald Baumgartner,et al.  Memory-Optimal Evaluation of Expression Trees Involving Large Objects , 1999, HiPC.

[45]  David E. Bernholdt,et al.  Towards Automatic Synthesis of High-Performance Codes for Electronic Structure Calculations: Data Locality Optimization , 2001, HiPC.

[46]  Victor Eijkhout,et al.  Self-Adapting Linear Algebra Algorithms and Software , 2005, Proceedings of the IEEE.

[47]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[48]  Mahmut T. Kandemir,et al.  Estimating and reducing the memory requirements of signal processing codes for embedded systems , 2006, IEEE Transactions on Signal Processing.

[49]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[50]  R. C. Whaley,et al.  Automatically Tuned Linear Algebra Software (ATLAS) , 2011, Encyclopedia of Parallel Computing.

[51]  Gerald Baumgartner,et al.  Optimization of Memory Usage Requirement for a Class of Loops Implementing Multi-dimensional Integrals , 1999, LCPC.

[52]  Ken Kennedy,et al.  Telescoping Languages: A System for Automatic Generation of Domain Languages , 2005, Proceedings of the IEEE.

[53]  M. Head‐Gordon,et al.  A fifth-order perturbation comparison of electron correlation theories , 1989 .

[54]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .

[55]  Paul N. Hilfinger,et al.  Better Tiling and Array Contraction for Compiling Scientific Programs , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[56]  J. Ramanujam,et al.  Global communication optimization for tensor contraction expressions under memory constraints , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[57]  Francky Catthoor,et al.  Custom Memory Management Methodology , 1998, Springer US.

[58]  Gustavo E. Scuseria,et al.  Achieving Chemical Accuracy with Coupled-Cluster Theory , 1995 .

[59]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[60]  Leonidas J. Guibas,et al.  Compilation and delayed evaluation in APL , 1978, POPL.

[61]  Monica S. Lam,et al.  Blocking and array contraction across arbitrarily nested loops using affine partitioning , 2001, PPoPP '01.

[62]  Chau-Wen Tseng,et al.  A Comparison of Compiler Tiling Algorithms , 1999, CC.

[63]  Yonghong Song,et al.  Compiler algorithms for efficient use of memory systems , 2000 .

[64]  David E. Bernholdt,et al.  Space-time trade-off optimization for a class of electronic structure calculations , 2002, PLDI '02.

[65]  Larry Carter,et al.  Schedule-independent storage mapping for loops , 1998, ASPLOS VIII.

[66]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[67]  Michael E. Wolf,et al.  Combining Loop Transformations Considering Caches and Scheduling , 2004, International Journal of Parallel Programming.

[68]  Cheng Wang,et al.  Data locality enhancement by memory reduction , 2001, ICS '01.

[69]  Robert J. Harrison,et al.  Shared Memory Programming in Metacomputing Environments: The Global Array Approach , 1997, The Journal of Supercomputing.

[70]  David A. Padua,et al.  Searching for the Best FFT Formulas with the SPL Compiler , 2000, LCPC.