High-Performance Out-of-core Block Randomized Singular Value Decomposition on GPU

Fast computation of singular value decomposition (SVD) is of great interest in various machine learning tasks. Recently, SVD methods based on randomized linear algebra have shown significant speedup in this regime. This paper attempts to further accelerate the computation by harnessing a modern computing architecture, namely graphics processing unit (GPU), with the goal of processing large-scale data that may not fit in the GPU memory. It leads to a new block randomized algorithm that fully utilizes the power of GPUs and efficiently processes large-scale data in an out-of- core fashion. Our experiment shows that the proposed block randomized SVD (BRSVD) method outperforms existing randomized SVD methods in terms of speed with retaining the same accuracy. We also show its application to convex robust principal component analysis, which shows significant speedup in computer vision applications.

[1]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[2]  Yaohang Li,et al.  GPU Accelerated Randomized Singular Value Decomposition and Its Application in Image Compression , 2015 .

[3]  R. Larsen Lanczos Bidiagonalization With Partial Reorthogonalization , 1998 .

[4]  Jack J. Dongarra,et al.  Performance of random sampling for computing low-rank approximations of a dense matrix on GPUs , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Arvind Ganesh,et al.  Fast Convex Optimization Algorithms for Exact Recovery of a Corrupted Low-Rank Matrix , 2009 .

[6]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[7]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[8]  Gene H. Golub,et al.  Matrix computations , 1983 .

[9]  Jie Cheng,et al.  CUDA by Example: An Introduction to General-Purpose GPU Programming , 2010, Scalable Comput. Pract. Exp..

[10]  Per-Gunnar Martinsson,et al.  RSVDPACK: Subroutines for computing partial singular value decompositions via randomized sampling on single core, multi core, and GPU architectures , 2015, ArXiv.

[11]  David I. August,et al.  Automatic CPU-GPU communication management and optimization , 2011, PLDI '11.

[12]  Jack Dongarra,et al.  Random Sampling to Update Partial Singular Value Decomposition on a Hybrid CPU / GPU Cluster , 2015 .

[13]  G. Sapiro,et al.  A collaborative framework for 3D alignment and classification of heterogeneous subvolumes in cryo-electron tomography. , 2013, Journal of structural biology.

[14]  Gene H. Golub,et al.  Calculating the singular values and pseudo-inverse of a matrix , 2007, Milestones in Matrix Computation.

[15]  Jack J. Dongarra,et al.  A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations , 2015, ISC.

[16]  James Demmel,et al.  Communication-Avoiding QR Decomposition for GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[17]  Alan M. Frieze,et al.  Fast Monte-Carlo algorithms for finding low-rank approximations , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[18]  Xiaoming Yuan,et al.  Sparse and low-rank matrix decomposition via alternating direction method , 2013 .

[19]  David J. Kriegman,et al.  From Few to Many: Illumination Cone Models for Face Recognition under Variable Lighting and Pose , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Tae-Hyun Oh Fast Randomized Singular Value Thresholding for Nuclear Norm Minimization , 2015 .

[21]  Jie Cheng,et al.  Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[22]  Jack J. Dongarra,et al.  High-performance Cholesky factorization for GPU-only execution , 2017, GPGPU@PPoPP.

[23]  Touradj Ebrahimi,et al.  UHD video dataset for evaluation of privacy , 2014, 2014 Sixth International Workshop on Quality of Multimedia Experience (QoMEX).

[24]  Yi Ma,et al.  The Augmented Lagrange Multiplier Method for Exact Recovery of Corrupted Low-Rank Matrices , 2010, Journal of structural biology.

[25]  Jack J. Dongarra,et al.  Dense linear algebra solvers for multicore with GPU accelerators , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[26]  James Demmel,et al.  Communication lower bounds and optimal algorithms for programs that reference arrays - Part 1 , 2013, ArXiv.

[27]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[28]  James Demmel,et al.  Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[29]  Per-Gunnar Martinsson,et al.  Randomized algorithms for the low-rank approximation of matrices , 2007, Proceedings of the National Academy of Sciences.

[30]  V. Rokhlin,et al.  A randomized algorithm for the approximation of matrices , 2006 .