Unleashing GPU acceleration for symmetric band linear algebra kernels and model reduction

Linear algebra operations arise in a myriad of scientific and engineering applications and, therefore, their optimization is targeted by a significant number of high performance computing research efforts. In particular, the matrix multiplication and the solution of linear systems are two key problems with efficient implementations (or kernels) for a variety of high performance parallel architectures. For these specific problems, leveraging the structure of the associated matrices often leads to remarkable time and memory savings, as is the case, e.g., for symmetric band problems. In this work, we exploit the ample hardware concurrency of many-core graphics processors (GPUs) to accelerate the solution of symmetric positive definite band linear systems, introducing highly tuned versions of the corresponding LAPACK routines. The experimental results with the new GPU kernels reveal important reductions of the execution time when compared with tuned implementations of the same operations provided in Intel’s MKL. In addition, we evaluate the performance of the GPU kernels when applied to the solution of model order reduction problems and the associated matrix equations.

[1]  Jeremy Du Croz,et al.  Factorizations of Band Matrices Using Level 3 BLAS , 1990, CONPAR.

[2]  Enrique S. Quintana-Ortí,et al.  Accelerating Band Linear Algebra Operations on GPUs with Application in Model Reduction , 2014, ICCSA.

[3]  Peter Benner,et al.  Self-Generating and Efficient Shift Parameters in ADI Methods for Large Lyapunov and Sylvester Equations , 2014 .

[4]  E. Cuthill,et al.  Reducing the bandwidth of sparse symmetric matrices , 1969, ACM '69.

[5]  Peter Benner,et al.  Efficient handling of complex shift parameters in the low-rank Cholesky factor ADI method , 2012, Numerical Algorithms.

[6]  Enrique S. Quintana-Ortí,et al.  Matrix inversion on CPU–GPU platforms with applications in control theory , 2013, Concurr. Comput. Pract. Exp..

[7]  James Demmel,et al.  LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs , 2008 .

[8]  Robert M. Farber,et al.  CUDA Application Design and Development , 2011 .

[9]  Athanasios C. Antoulas,et al.  Approximation of Large-Scale Dynamical Systems , 2005, Advances in Design and Control.

[10]  Enrique S. Quintana-Ortí,et al.  Efficient Symmetric Band Matrix-Matrix Multiplication on GPUs , 2014, CARLA.

[11]  Enrique S. Quintana-Ortí,et al.  Exploiting the capabilities of modern GPUs for dense matrix computations , 2009, Concurr. Comput. Pract. Exp..

[12]  Wen-mei W. Hwu,et al.  Programming Massively Parallel Processors, Third Edition: A Hands-on Approach , 2016 .

[13]  Athanasios C. Antoulas,et al.  Approximation of Large-Scale Dynamical Systems (Advances in Design and Control) (Advances in Design and Control) , 2005 .

[14]  Paolo Bientinesi,et al.  Performance Modeling for Dense Linear Algebra , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[15]  James Demmel,et al.  LAPACK Users' Guide, Third Edition , 1999, Software, Environments and Tools.

[16]  Thilo Penzl,et al.  A Cyclic Low-Rank Smith Method for Large Sparse Lyapunov Equations , 1998, SIAM J. Sci. Comput..

[17]  Rafael Mayo,et al.  Solving Dense Linear Systems on Graphics Processors , 2008, Euro-Par.