Reducing the amount of out‐of‐core data access for GPU‐accelerated randomized SVD

We propose two acceleration methods, namely, Fused and Gram, for reducing out‐of‐core data access when performing randomized singular value decomposition (RSVD) on graphics processing units (GPUs). Out‐of‐core data here are data that are too large to fit into the GPU memory at once. Both methods accelerate GPU‐enabled RSVD using the following three schemes: (1) a highly tuned general matrix‐matrix multiplication (GEMM) scheme for processing out‐of‐core data on GPUs; (2) a data‐access reduction scheme based on one‐dimensional data partition; and (3) a first‐in, first‐out scheme that reduces CPU‐GPU data transfer using the reverse iteration. The Fused method further reduces the amount of out‐of‐core data access by merging two GEMM operations into a single operation. By contrast, the Gram method reduces both in‐core and out‐of‐core data access by explicitly forming the Gram matrix. According to our experimental results, the Fused and Gram methods improved the RSVD performance up to 1.7× and 5.2×, respectively, compared with a straightforward method that deploys schemes (1) and (2) on the GPU. In addition, we present a case study of deploying the Gram method for accelerating robust principal component analysis, a convex optimization problem in machine learning.

[1]  Volkan Cevher,et al.  Practical Sketching Algorithms for Low-Rank Matrix Approximation , 2016, SIAM J. Matrix Anal. Appl..

[2]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[3]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[4]  Kesheng Wu,et al.  A Block Orthogonalization Procedure with Constant Synchronization Requirements , 2000, SIAM J. Sci. Comput..

[5]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[6]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[7]  Jack J. Dongarra,et al.  The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme Scale , 2018, SIAM Rev..

[8]  Jack Dongarra,et al.  LINPACK Users' Guide , 1987 .

[9]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[10]  V. Kshirsagar,et al.  Face recognition using Eigenfaces , 2011, 2011 3rd International Conference on Computer Research and Development.

[11]  Yaohang Li,et al.  GPU Accelerated Randomized Singular Value Decomposition and Its Application in Image Compression , 2015 .

[12]  Eduardo F. D'Azevedo,et al.  Parallel LU Factorization on GPU Cluster , 2012, ICCS.

[13]  James Demmel,et al.  Communication-Avoiding QR Decomposition for GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[14]  Jack J. Dongarra,et al.  Randomized algorithms to update partial singular value decomposition on a hybrid CPU/GPU cluster , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  V. Rokhlin,et al.  A randomized algorithm for the approximation of matrices , 2006 .

[16]  Stanimire Tomov,et al.  A Guide for Achieving High Performance with Very Small Matrices on GPU: A Case Study of Batched LU and Cholesky Factorizations , 2018, IEEE Transactions on Parallel and Distributed Systems.

[17]  C. Frankenberg,et al.  Prospects for Chlorophyll Fluorescence Remote Sensing from the Orbiting Carbon Observatory-2 , 2014 .

[18]  Junzhou Huang,et al.  Robust tracking using local sparse appearance model and K-selection , 2011, CVPR 2011.

[19]  Xiaoming Yuan,et al.  Sparse and low-rank matrix decomposition via alternating direction method , 2013 .

[20]  Mark Hoemmen,et al.  Communication-avoiding Krylov subspace methods , 2010 .

[21]  Arvind Ganesh,et al.  Fast Convex Optimization Algorithms for Exact Recovery of a Corrupted Low-Rank Matrix , 2009 .

[22]  Per-Gunnar Martinsson,et al.  Randomized algorithms for the low-rank approximation of matrices , 2007, Proceedings of the National Academy of Sciences.

[23]  Jack J. Dongarra,et al.  Performance of random sampling for computing low-rank approximations of a dense matrix on GPUs , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[24]  Jack J. Dongarra,et al.  Non‐GPU‐resident symmetric indefinite factorization , 2017, Concurr. Comput. Pract. Exp..

[25]  Michael J. Black,et al.  A Framework for Robust Subspace Learning , 2003, International Journal of Computer Vision.

[26]  Yi Yang,et al.  BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing , 2015, ICS.

[27]  Michael W. Mahoney,et al.  A randomized algorithm for a tensor-based generalization of the singular value decomposition , 2007 .

[28]  Tamás Sarlós,et al.  Improved Approximation Algorithms for Large Matrices via Random Projections , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[29]  Yaohang Li,et al.  Single-Pass PCA of Large High-Dimensional Data , 2017, IJCAI.

[30]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[31]  A. Hoecker,et al.  SVD APPROACH TO DATA UNFOLDING , 1995, hep-ph/9509307.

[32]  N. Benjamin Erichson,et al.  Randomized low-rank Dynamic Mode Decomposition for motion detection , 2015, Comput. Vis. Image Underst..

[33]  Hyeonjoon Moon,et al.  The FERET evaluation methodology for face-recognition algorithms , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[34]  Jack J. Dongarra,et al.  Out of memory SVD solver for big data , 2017, 2017 IEEE High Performance Extreme Computing Conference (HPEC).

[35]  E. Henry,et al.  [8] Singular value decomposition: Application to analysis of experimental data , 1992 .

[36]  Henryk Wozniakowski,et al.  Estimating the Largest Eigenvalue by the Power and Lanczos Algorithms with a Random Start , 1992, SIAM J. Matrix Anal. Appl..

[37]  Fumihiko Ino,et al.  GPU‐based branch‐and‐bound method to solve large 0‐1 knapsack problems with data‐centric strategies , 2018, Concurr. Comput. Pract. Exp..

[38]  Luis Mateus Rocha,et al.  Singular value decomposition and principal component analysis , 2003 .

[39]  James Demmel,et al.  Communication-avoiding algorithms for linear algebra and beyond , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[40]  Michael W. Mahoney Randomized Algorithms for Matrices and Data , 2011, Found. Trends Mach. Learn..

[41]  B. S. Garbow,et al.  Matrix Eigensystem Routines — EISPACK Guide , 1974, Lecture Notes in Computer Science.

[42]  R. Larsen Lanczos Bidiagonalization With Partial Reorthogonalization , 1998 .

[43]  Alan M. Frieze,et al.  Fast monte-carlo algorithms for finding low-rank approximations , 2004, JACM.

[44]  Per-Gunnar Martinsson,et al.  RSVDPACK: An implementation of randomized algorithms for computing the singular value, interpolative, and CUR decompositions of matrices on multi-core and GPU architectures , 2015 .

[45]  Baoxin Li,et al.  Discriminative K-SVD for dictionary learning in face recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[46]  J. Kuczy,et al.  Estimating the Largest Eigenvalue by the Power and Lanczos Algorithms with a Random Start , 1992 .

[47]  Dingwen Tao,et al.  TSM2: optimizing tall-and-skinny matrix-matrix multiplication on GPUs , 2019, ICS.

[48]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2009, Parallel Comput..

[49]  Yi Ma,et al.  The Augmented Lagrange Multiplier Method for Exact Recovery of Corrupted Low-Rank Matrices , 2010, Journal of structural biology.

[50]  Petros Drineas,et al.  Fast Monte Carlo Algorithms for Matrices I: Approximating Matrix Multiplication , 2006, SIAM J. Comput..

[51]  Yasuyuki Matsushita,et al.  Fast randomized Singular Value Thresholding for Nuclear Norm Minimization , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  H. Andrews,et al.  Singular Value Decomposition (SVD) Image Coding , 1976, IEEE Trans. Commun..

[53]  Stanimire Tomov,et al.  One-sided Dense Matrix Factorizations on a Multicore with Multiple GPU Accelerators , 2012, ICCS.

[54]  Nathaniel E. Helwig,et al.  An Introduction to Linear Algebra , 2006 .

[55]  Mark Tygert,et al.  A Randomized Algorithm for Principal Component Analysis , 2008, SIAM J. Matrix Anal. Appl..

[56]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[57]  Ed Anderson,et al.  LAPACK Users' Guide , 1995 .

[58]  Gene H. Golub,et al.  Calculating the singular values and pseudo-inverse of a matrix , 2007, Milestones in Matrix Computation.

[59]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[60]  Jack J. Dongarra,et al.  Accelerating the SVD two stage bidiagonal reduction and divide and conquer using GPUs , 2017, Parallel Comput..

[61]  Jack Dongarra,et al.  Random Sampling to Update Partial Singular Value Decomposition on a Hybrid CPU / GPU Cluster , 2015 .