phiGEMM: A CPU-GPU Library for Porting Quantum ESPRESSO on Hybrid Systems

GPU computing has revolutionized HPC by bringing the performance of the supercomputer to the desktop. Attractive price, performance, and power characteristics allow multiple GPUs to be plugged into both desktop machines as well as supercomputer nodes for increased performance. Excellent performance and scalability can be achieved for some problems using hybrid combinations of multiple GPUs and CPU computing resources. This paper presents the acceleration of the open-source Quantum ESPRESSO package with the freely available phiGEMM library. Specifically, the parallel implementation and scaling of the phiGEMM matrix-matrix multiplication will be discussed. This library can be called from applications through all standard GEMM interfaces and it is able to perform matrix-matrix multiplications using one or more GPUs as well as the host multi-core processor. An 8.9-times speedup is reported in overall run-time of a representative AUSURF112 benchmark for a PWscf calculation. In addition, multi-GPU scaling and performance for 3D-FFTs are discussed.

[1]  Warren E. Pickett,et al.  Pseudopotential methods in condensed matter applications , 1989 .

[2]  R. Dreizler,et al.  Density-Functional Theory , 1990 .

[3]  S. Froyen,et al.  Brillouin-zone integration by Fourier quadrature: Special points for superlattice and supercell calculations. , 1989, Physical review. B, Condensed matter.

[4]  Yanli Wang,et al.  Quantum ESPRESSO: a modular and open-source software project for quantum simulations of materials , 2009 .

[5]  W. Kohn,et al.  Self-Consistent Equations Including Exchange and Correlation Effects , 1965 .

[6]  Mark S. Lundstrom,et al.  APPLIED PHYSICS: Enhanced: Moore's Law Forever? , 2003 .

[7]  Robert M Farber,et al.  Topical perspective on massive threading and parallelism. , 2011, Journal of molecular graphics & modelling.

[8]  D. Vanderbilt,et al.  Soft self-consistent pseudopotentials in a generalized eigenvalue formalism. , 1990, Physical review. B, Condensed matter.

[9]  Massimiliano Fatica Accelerating linpack with CUDA on heterogenous clusters , 2009, GPGPU-2.

[10]  Paolo Giannozzi,et al.  Large-scale computing with Quantum ESPRESSO , 2009 .

[11]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[12]  A. D. Corso,et al.  A Pseudopotential Plane Waves Program (PWSCF) and some Case Studies , 1996 .

[13]  Robert M. Farber,et al.  Multi-Threaded Architectures: Evolution, Costs, Opportunities , 2012 .

[14]  Jack Dongarra,et al.  LAPACK Users' Guide, 3rd ed. , 1999 .

[15]  R. Parr Density-functional theory of atoms and molecules , 1989 .

[16]  M. Orio,et al.  Density functional theory , 2009, Photosynthesis Research.

[17]  Johnson,et al.  Modified Broyden's method for accelerating convergence in self-consistent calculations. , 1988, Physical review. B, Condensed matter.

[18]  Robert M. Farber,et al.  CUDA Application Design and Development , 2011 .