论文信息 - H2OPUS-TLR: High Performance Tile Low Rank Symmetric Factorizations using Adaptive Randomized Approximation

H2OPUS-TLR: High Performance Tile Low Rank Symmetric Factorizations using Adaptive Randomized Approximation

Tile low rank (TLR) representations of dense matrices partition them into blocks of roughly uniform size, where each off-diagonal tile is compressed and stored as its own low rank factorization. They offer an attractive representation for many data-sparse dense operators that appear in practical applications, where substantial compression and a much smaller memory footprint can be achieved. TLR matrices are a compromise between the simplicity of a regular perfectly-strided data structure and the optimal complexity of the unbalanced trees of hierarchically low rank matrices, and provide a convenient performance-tuning parameter through their tile size that can be proportioned to take into account the cache size where the tiles reside in the memory hierarchy. Despite their utility however, there are currently no high performance algorithms that can generate their Cholesky and LDLT factorizations and operate on them efficiently, particularly on GPUs. The difficulties in achieving high performance when factoring TLR matrices come from the expensive compression operations that must be performed during the factorization process and the adaptive rank distribution of the tiles that causes an irregular work pattern for the processing cores. In this work, we develop a dynamic batching operation and combine it with batched adaptive randomized approximations to remedy these difficulties and achieve high performance both on GPUs and CPUs. Our implementation attains over 1.2 TFLOP/s in double precision on the V100 GPU, and is limited primarily by the underlying performance of batched GEMM operations. The time-to-solution also shows substantial speedup compared to regular dense factorizations. The Cholesky factorization of covariance matrix of size N = 131K arising in 2D or 3D spatial statistics, for example, can be factored to an accuracy = 10−2 in just a few seconds. We believe the proposed GEMM-centric algorithm allows it to be readily ported to newer hardware such as the tensor cores that are optimized for small GEMM operations. King Abdullah University of Science and Technology, Extreme Computing Research Center, Thuwal, Saudi Arabia

[1] G. Turkiyyah,et al. Hierarchical algorithms on hierarchical architectures , 2020, Philosophical Transactions of the Royal Society A.

[2] Alfredo Buttari,et al. Block Low-Rank Matrices with Shared Bases: Potential and Limitations of the BLR2 Format , 2021, SIAM J. Matrix Anal. Appl..

[3] Eric Darve,et al. Sparse supernodal solver using block low-rank compression: Design, performance and analysis , 2018, J. Comput. Sci..

[4] Owe Axelsson,et al. Diagonally compensated reduction and related preconditioning methods , 1994, Numer. Linear Algebra Appl..

[5] Jian Cao,et al. Exploiting low-rank covariance structures for computing high-dimensional normal and Student-t probabilities , 2020, Statistics and Computing.

[6] Claude-Pierre Jeannerod,et al. Improving the Complexity of Block Low-Rank Factorizations with Fast Matrix Arithmetic , 2019, SIAM J. Matrix Anal. Appl..

[7] Leslie Greengard,et al. Fast Direct Methods for Gaussian Processes , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8] David E. Keyes,et al. High Performance Multivariate Geospatial Statistics on Manycore Systems , 2021, IEEE Transactions on Parallel and Distributed Systems.

[9] Jack J. Dongarra,et al. Reducing the Amount of Pivoting in Symmetric Indefinite Systems , 2011, PPAM.

[10] Alexander Heinecke,et al. LIBXSMM: Accelerating Small Matrix Multiplications by Runtime Code Generation , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[11] Nathan Halko,et al. Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[12] Nicholas J. Higham,et al. A New Preconditioner that Exploits Low-Rank Approximations to Factorization Error , 2018, SIAM J. Sci. Comput..

[13] D. Keyes,et al. Hierarchical matrix approximations for space-fractional diffusion equations , 2020, Computer Methods in Applied Mechanics and Engineering.

[14] David E. Keyes,et al. Randomized GPU Algorithms for the Construction of Hierarchical Matrices from Matrix-Vector Operations , 2019, SIAM J. Sci. Comput..

[15] Alfredo Buttari,et al. On the Complexity of the Block Low-Rank Multifrontal Factorization , 2017, SIAM J. Sci. Comput..

[16] Wolfgang Hackbusch,et al. A Sparse Matrix Arithmetic Based on H-Matrices. Part I: Introduction to H-Matrices , 1999, Computing.

[17] Stefano Zampini,et al. PCBDDC: A Class of Robust Dual-Primal Methods in PETSc , 2016, SIAM J. Sci. Comput..

[18] Per-Gunnar Martinsson,et al. A direct solver with O(N) complexity for integral equations on one-dimensional domains , 2011, 1105.5372.

[19] Edmond Chow,et al. Preserving Positive Definiteness in Hierarchically Semiseparable Matrix Approximations , 2018, SIAM J. Matrix Anal. Appl..

[20] William B. March,et al. ASKIT: Approximate Skeletonization Kernel-Independent Treecode in High Dimensions , 2014, SIAM J. Sci. Comput..

[21] David E. Keyes,et al. On the Robustness and Prospects of Adaptive BDDC Methods for Finite Element Discretizations of Elliptic PDEs with High-Contrast Coefficients , 2016, PASC.

[22] S. H. Cheng,et al. A Modified Cholesky Algorithm Based on a Symmetric Indefinite Factorization , 1998, SIAM J. Matrix Anal. Appl..

[23] S. Börm. Efficient Numerical Methods for Non-local Operators , 2010 .

[24] David E. Keyes,et al. Hierarchical Matrix Operations on GPUs: Matrix-Vector Multiplication and Compression , 2019, ACM Trans. Math. Softw..

[25] Sergej Rjasanow,et al. Adaptive Cross Approximation of Dense Matrices , 2000 .

[26] Patrick R. Amestoy,et al. Bridging the Gap Between Flat and Hierarchical Low-Rank Matrix Formats: The Multilevel Block Low-Rank Format , 2019, SIAM J. Sci. Comput..

[27] David E. Keyes,et al. Parallel accelerated cyclic reduction preconditioner for three-dimensional elliptic PDEs with variable coefficients , 2017, J. Comput. Appl. Math..

[28] Per-Gunnar Martinsson,et al. On the Compression of Low Rank Matrices , 2005, SIAM J. Sci. Comput..

[29] Jianlin Xia,et al. Robust Approximate Cholesky Factorization of Rank-Structured Symmetric Positive Definite Matrices , 2010, SIAM J. Matrix Anal. Appl..

[30] Jennifer A. Scott,et al. On Positive Semidefinite Modification Schemes for Incomplete Cholesky Factorization , 2014, SIAM J. Sci. Comput..

[31] Sjsu ScholarWorks,et al. Rank revealing QR factorizations , 2014 .

[32] Pieter Ghysels,et al. A Distributed-Memory Package for Dense Hierarchically Semi-Separable Matrix Computations Using Randomization , 2015, ACM Trans. Math. Softw..

[33] David E. Keyes,et al. Hierarchical Decompositions for the Computation of High-Dimensional Multivariate Normal Probabilities , 2018 .

[34] Patrick R. Amestoy,et al. Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures , 2019, ACM Trans. Math. Softw..

[35] Yang Liu,et al. Scalable and Memory-Efficient Kernel Ridge Regression , 2020, 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[36] Yang Liu,et al. A Study of Clustering Techniques and Hierarchical Matrix Formats for Kernel Ridge Regression , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[37] Jean-Yves L'Excellent,et al. Improving Multifrontal Methods by Means of Block Low-Rank Representations , 2015, SIAM J. Sci. Comput..

[38] Julien Langou,et al. A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[39] David E. Keyes,et al. Hierarchical Matrix Approximations of Hessians Arising in Inverse Problems Governed by PDEs , 2020, SIAM J. Sci. Comput..