论文信息 - Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models

Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models

This paper provides an overview of a program synthesis system for a class of quantum chemistry computations. These computations are expressible as a set of tensor contractions and arise in electronic structure modeling. The input to the system is a a high-level specification of the computation, from which the system can synthesize high-performance parallel code tailored to the characteristics of the target architecture. Several components of the synthesis system are described, focusing on performance optimization issues that they address.

[1] Gang Ren,et al. A comparison of empirical and model-driven optimization , 2003, PLDI '03.

[2] Ken Kennedy,et al. Telescoping Languages: A Strategy for Automatic Generation of Scientific Problem-Solving Systems from Annotated Libraries , 2001, J. Parallel Distributed Comput..

[3] P. Kollman,et al. Encyclopedia of computational chemistry , 1998 .

[4] Sharad Malik,et al. Precise miss analysis for program transformations with caches of arbitrary associativity , 1998, ASPLOS VIII.

[5] Kathryn S. McKinley,et al. Tile size selection using cache organization and data layout , 1995, PLDI '95.

[6] David E. Bernholdt,et al. Memory-Constrained Data Locality Optimization for Tensor Contractions , 2003, LCPC.

[7] Robert J. Harrison,et al. Global Arrays: a portable "shared-memory" programming model for distributed memory computers , 1994, Proceedings of Supercomputing '94.

[8] V. Sarkar,et al. Collective Loop Fusion for Array Contraction , 1992, LCPC.

[9] David A. Padua,et al. SPL: a language and compiler for DSP algorithms , 2001, PLDI '01.

[10] Tarek S. Abdelrahman,et al. Fusion of Loops for Parallelism and Locality , 1997, IEEE Trans. Parallel Distributed Syst..

[11] Chi-Chung Lam,et al. On Optimizing a Class of Multi-Dimensional Loops with Reductions for Parallel Execution , 1997, Parallel Process. Lett..

[12] PingaliKeshav,et al. A case for source-level transformations in MATLAB , 1999 .

[13] Clemens Grelck,et al. With-Loop Fusion for Data Locality and Parallelism , 2005, IFL.

[14] Keshav Pingali,et al. A case for source-level transformations in MATLAB , 1999, DSL '99.

[15] Keshav Pingali,et al. Synthesizing transformations for locality enhancement of imperfectly-nested loop nests , 2000 .

[16] Wei Li,et al. Compiling for NUMA Parallel Machines , 1993 .

[17] Mark S. Gordon,et al. General atomic and molecular electronic structure system , 1993, J. Comput. Chem..

[18] Alan Edelman,et al. Parallel MATLAB: Doing it Right , 2005, Proceedings of the IEEE.

[19] So Hirata,et al. Third-order Douglas-Kroll relativistic coupled-cluster theory through connected single, double, triple, and quadruple substitutions: applications to diatomic and triatomic hydrides. , 2004, The Journal of chemical physics.

[20] Keshav Pingali,et al. Data-centric multi-level blocking , 1997, PLDI '97.

[21] Larry Carter,et al. Quantifying the Multi-Level Nature of Tiling Interactions , 1997, International Journal of Parallel Programming.

[22] Keshav Pingali,et al. Synthesizing Transformations for Locality Enhancement of Imperfectly-Nested Loop Nests , 2001, International Journal of Parallel Programming.

[23] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[24] Cheng Wang,et al. Locality Enhancement by Array Contraction , 2001, LCPC.

[25] J. Ramanujam,et al. Loop optimization for a class of memory-constrained computations , 2001, ICS '01.

[26] Steven G. Johnson,et al. The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[27] J. Ramanujam,et al. Memory-Constrained Communication Minimization for a Class of Array Computations , 2002, LCPC.

[28] Æleen Frisch,et al. Exploring chemistry with electronic structure methods , 1996 .

[29] David A. Padua,et al. A MATLAB to Fortran 90 translator and its effectiveness , 1996, ICS '96.

[30] Ken Kennedy,et al. Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.

[31] Chi-Chung Lam,et al. Optimization of a Class of Multi-Dimensional Integrals on Parallel Machines , 1997, PPSC.

[32] Anne Mignotte,et al. Loop alignment for memory accesses optimization , 1999, Proceedings 12th International Symposium on System Synthesis.

[33] Mahmut T. Kandemir,et al. Reducing memory requirements of nested loops for embedded systems , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[34] Keshav Pingali,et al. An experimental evaluation of tiling and shackling for memory hierarchy management , 1999, ICS '99.

[35] Keshav Pingali,et al. High-level semantic optimization of numerical codes , 1999, ICS '99.

[36] Zhiyuan Li,et al. New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[37] Gang Ren,et al. Is Search Really Necessary to Generate High-Performance BLAS? , 2005, Proceedings of the IEEE.

[38] Jan M. L. Martin. Benchmark Studies on Small Molecules , 2002 .

[39] Chi-Chung Lam,et al. Performance optimization of a class of loops implementing multidimensional integrals , 1999 .

[40] Kathryn S. McKinley,et al. Loop Fusion for Data Locality and Parallelism , 1996 .

[41] Francky Catthoor,et al. Data dependency size estimation for use in memory optimization , 2003, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[42] S. Hirata. Tensor Contraction Engine: Abstraction and Automated Parallel Implementation of Configuration-Interaction, Coupled-Cluster, and Many-Body Perturbation Theories , 2003 .

[43] David E. Bernholdt,et al. Data Locality Optimization for Synthesis of Efficient Out-of-Core Algorithms , 2003, HiPC.

[44] Gerald Baumgartner,et al. Memory-Optimal Evaluation of Expression Trees Involving Large Objects , 1999, HiPC.

[45] David E. Bernholdt,et al. Towards Automatic Synthesis of High-Performance Codes for Electronic Structure Calculations: Data Locality Optimization , 2001, HiPC.

[46] Victor Eijkhout,et al. Self-Adapting Linear Algebra Algorithms and Software , 2005, Proceedings of the IEEE.

[47] Steven G. Johnson,et al. FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[48] Mahmut T. Kandemir,et al. Estimating and reducing the memory requirements of signal processing codes for embedded systems , 2006, IEEE Transactions on Signal Processing.

[49] Chau-Wen Tseng,et al. Improving data locality with loop transformations , 1996, TOPL.

[50] R. C. Whaley,et al. Automatically Tuned Linear Algebra Software (ATLAS) , 2011, Encyclopedia of Parallel Computing.

[51] Gerald Baumgartner,et al. Optimization of Memory Usage Requirement for a Class of Loops Implementing Multi-dimensional Integrals , 1999, LCPC.

[52] Ken Kennedy,et al. Telescoping Languages: A System for Automatic Generation of Domain Languages , 2005, Proceedings of the IEEE.

[53] M. Head‐Gordon,et al. A fifth-order perturbation comparison of electron correlation theories , 1989 .

[54] Lynn Elliot Cannon,et al. A cellular computer to implement the kalman filter algorithm , 1969 .

[55] Paul N. Hilfinger,et al. Better Tiling and Array Contraction for Compiling Scientific Programs , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[56] J. Ramanujam,et al. Global communication optimization for tensor contraction expressions under memory constraints , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[57] Francky Catthoor,et al. Custom Memory Management Methodology , 1998, Springer US.

[58] Gustavo E. Scuseria,et al. Achieving Chemical Accuracy with Coupled-Cluster Theory , 1995 .

[59] Franz Franchetti,et al. SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[60] Leonidas J. Guibas,et al. Compilation and delayed evaluation in APL , 1978, POPL.

[61] Monica S. Lam,et al. Blocking and array contraction across arbitrarily nested loops using affine partitioning , 2001, PPoPP '01.

[62] Chau-Wen Tseng,et al. A Comparison of Compiler Tiling Algorithms , 1999, CC.

[63] Yonghong Song,et al. Compiler algorithms for efficient use of memory systems , 2000 .

[64] David E. Bernholdt,et al. Space-time trade-off optimization for a class of electronic structure calculations , 2002, PLDI '02.

[65] Larry Carter,et al. Schedule-independent storage mapping for loops , 1998, ASPLOS VIII.

[66] Monica S. Lam,et al. The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[67] Michael E. Wolf,et al. Combining Loop Transformations Considering Caches and Scheduling , 2004, International Journal of Parallel Programming.

[68] Cheng Wang,et al. Data locality enhancement by memory reduction , 2001, ICS '01.

[69] Robert J. Harrison,et al. Shared Memory Programming in Metacomputing Environments: The Global Array Approach , 1997, The Journal of Supercomputing.

[70] David A. Padua,et al. Searching for the Best FFT Formulas with the SPL Compiler , 2000, LCPC.