Tile Low-Rank GEMM Using Batched Operations on GPUs
暂无分享,去创建一个
[1] Anima Anandkumar,et al. Tensor Contractions with Extended BLAS Kernels on CPU and GPU , 2016, 2016 IEEE 23rd International Conference on High Performance Computing (HiPC).
[2] Jack J. Dongarra,et al. Performance, Design, and Autotuning of Batched GEMM for GPUs , 2016, ISC.
[3] Steffen Börm,et al. Data-sparse Approximation by Adaptive ℋ2-Matrices , 2002, Computing.
[4] S. Börm. Efficient Numerical Methods for Non-local Operators , 2010 .
[5] Nathan Halko,et al. Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..
[6] W. Hackbusch. A Sparse Matrix Arithmetic Based on $\Cal H$-Matrices. Part I: Introduction to ${\Cal H}$-Matrices , 1999, Computing.
[7] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.
[8] David E. Keyes,et al. Performance optimization of Sparse Matrix‐Vector Multiplication for multi‐component PDE‐based applications using GPUs , 2016, Concurr. Comput. Pract. Exp..
[9] David E. Keyes,et al. Tile Low Rank Cholesky Factorization for Climate/Weather Modeling Applications on Manycore Architectures , 2017, ISC.
[10] David E. Keyes,et al. Accelerated Cyclic Reduction: A distributed-memory fast solver for structured linear systems , 2017, Parallel Comput..
[11] Eric Darve,et al. An $$\mathcal O (N \log N)$$O(NlogN) Fast Direct Solver for Partial Hierarchically Semi-Separable Matrices , 2013 .
[12] Pieter Ghysels,et al. A Distributed-Memory Package for Dense Hierarchically Semi-Separable Matrix Computations Using Randomization , 2015, ACM Trans. Math. Softw..
[13] Jean-Yves L'Excellent,et al. Improving Multifrontal Methods by Means of Block Low-Rank Representations , 2015, SIAM J. Sci. Comput..
[14] W. Hackbusch,et al. On H2-Matrices , 2000 .
[15] David E. Keyes,et al. Real-Time Massively Distributed Multi-object Adaptive Optics Simulations for the European Extremely Large Telescope , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[16] David E. Keyes,et al. Batched QR and SVD Algorithms on GPUs with Applications in Hierarchical Matrix Compression , 2017, Parallel Comput..
[17] Simon D. Hammond,et al. Designing Vector-Friendly Compact BLAS and LAPACK Kernels , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.
[18] Jack J. Dongarra,et al. An extended set of FORTRAN basic linear algebra subprograms , 1988, TOMS.
[19] Hatem Ltaief,et al. Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs , 2019, ACM Trans. Math. Softw..
[20] Jack J. Dongarra,et al. High-Performance Tensor Contractions for GPUs , 2016, ICCS.
[21] Alexander Heinecke,et al. LIBXSMM: Accelerating Small Matrix Multiplications by Runtime Code Generation , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.
[22] Marc G. Genton,et al. Correlation Models for Temperature Fields , 2011 .
[23] Ronald Kriemann,et al. $${{\fancyscript{H}}} $$H-LU factorization on many-core systems , 2013, Comput. Vis. Sci..
[24] Wolfgang Hackbusch,et al. Construction and Arithmetics of H-Matrices , 2003, Computing.
[25] Jack Dongarra,et al. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .
[26] David E. Keyes,et al. Exploiting Data Sparsity for Large-Scale Matrix Computations , 2018, Euro-Par.
[27] E. Tyrtyshnikov. Mosaic-Skeleton approximations , 1996 .