Acceleration of Streamed Tensor Contraction Expressions on GPGPU-Based Clusters

Tensor contractions are generalized multidimensional matrix multiplication operations that widely occur in quantum chemistry. Efficient execution of tensor contractions on GPUs requires tackling several challenges to be addressed, including index permutation and small dimension-sizes reducing thread block utilization. In this paper, we present our approach to automatically generate CUDA code to execute tensor contractions on GPUs, including management of data movement between CPU and GPU. GPU-enabled code is generated for the most expensive contractions in CCSD(T), a key coupled cluster method, and incorporated into NW Chem, a popular computational chemistry suite. We demonstrate speedup over a factor of 8.4 using one core per node and over 2.6 when utilizing the entire system using hybrid CPU+GPU solution with 2 GPUs and 5 cores. Finally, we analyze the implementation behavior on future GPU systems.

[1]  James Demmel,et al.  LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs , 2008 .

[2]  Robert J. Harrison,et al.  Liquid water: obtaining the right answer for the right reasons , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[3]  T. H. Dunning Gaussian basis sets for use in correlated molecular calculations. I. The atoms boron through neon and hydrogen , 1989 .

[4]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  P. J. Narayanan,et al.  Accelerating Large Graph Algorithms on the GPU Using CUDA , 2007, HiPC.

[6]  William Gropp,et al.  An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.

[7]  M. Head‐Gordon,et al.  A fifth-order perturbation comparison of electron correlation theories , 1989 .

[8]  P. Sadayappan,et al.  Optimal loop unrolling for GPGPU programs , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[9]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[10]  Richard W. Vuduc,et al.  Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.

[11]  Abhishek Udupa,et al.  Software Pipelined Execution of Stream Programs on GPUs , 2009, 2009 International Symposium on Code Generation and Optimization.

[12]  J. Cizek On the Correlation Problem in Atomic and Molecular Systems. Calculation of Wavefunction Components in Ursell-Type Expansion Using Quantum-Field Theoretical Methods , 1966 .

[13]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[14]  Matthias S. Müller,et al.  Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[15]  Wen-mei W. Hwu,et al.  Program optimization space pruning for a multithreaded gpu , 2008, CGO '08.

[16]  Satoshi Matsuoka,et al.  Bandwidth intensive 3-D FFT kernel for GPUs using CUDA , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Kevin Skadron,et al.  Accelerating leukocyte tracking using CUDA: A case study in leveraging manycore coprocessors , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[18]  S. Hirata Tensor Contraction Engine: Abstraction and Automated Parallel Implementation of Configuration-Interaction, Coupled-Cluster, and Many-Body Perturbation Theories , 2003 .

[19]  R. Bartlett,et al.  Coupled-cluster theory in quantum chemistry , 2007 .

[20]  Claudia Filippi,et al.  Absorption Spectrum of the Green Fluorescent Protein Chromophore: A Difficult Case for ab Initio Methods? , 2009, Journal of chemical theory and computation.

[21]  Sriram Krishnamoorthy,et al.  Combining analytical and empirical approaches in tuning matrix transposition , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[22]  Kevin Skadron,et al.  A performance study of general-purpose applications on graphics processors using CUDA , 2008, J. Parallel Distributed Comput..

[23]  David E. Bernholdt,et al.  Automatic code generation for many-body electronic structure methods: the tensor contraction engine , 2006 .

[24]  Amitabh Varshney,et al.  High-throughput sequence alignment using Graphics Processing Units , 2007, BMC Bioinformatics.

[25]  Jack J. Dongarra,et al.  A Note on Auto-tuning GEMM for GPUs , 2009, ICCS.

[26]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[27]  Michael I. Gordon,et al.  Exploiting coarse-grained task, data, and pipeline parallelism in stream programs , 2006, ASPLOS XII.

[28]  Josef Paldus,et al.  A Critical Assessment of Coupled Cluster Method in Quantum Chemistry , 2007 .

[29]  Gagan Agrawal,et al.  A translation system for enabling data mining applications on GPUs , 2009, ICS.

[30]  Uday Bondhugula,et al.  A compiler framework for optimization of affine loop nests for gpgpus , 2008, ICS '08.