Fast synchronization‐free algorithms for parallel sparse triangular solves with multiple right‐hand sides
暂无分享,去创建一个
Brian Vinter | Weifeng Liu | Ang Li | Iain S. Duff | Jonathan D. Hogg | I. Duff | Ang Li | B. Vinter | Weifeng Liu
[1] Wu-chun Feng,et al. Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures , 2015, ICPE.
[2] Henk Corporaal,et al. Locality-Aware CTA Clustering for Modern GPUs , 2017, ASPLOS.
[3] Milos Prvulovic,et al. MiSAR: Minimalistic synchronization accelerator with resource overflow management , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[4] Edmond Chow,et al. Domain Overlap for Iterative Sparse Triangular Solves on GPUs , 2016, Software for Exascale Computing.
[5] Henk Corporaal,et al. Fine-Grained Synchronizations and Dataflow Programming on GPUs , 2015, ICS.
[6] David P. Anderson. Preserving hybrid objects , 2016, Commun. ACM.
[7] Brian Vinter,et al. Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors , 2015, Parallel Comput..
[8] Wu-chun Feng,et al. Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[9] Padma Raghavan,et al. Adapting Sparse Triangular Solution to GPUs , 2012, 2012 41st International Conference on Parallel Processing Workshops.
[10] Joel H. Saltz,et al. Aggregation Methods for Solving Sparse Triangular Systems on Multiprocessors , 1990, SIAM J. Sci. Comput..
[11] Brian Vinter,et al. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication , 2015, ICS.
[12] Weifeng Liu,et al. Fast segmented sort on GPUs , 2017, ICS.
[13] Jan Mayer,et al. Parallel algorithms for solving linear systems with sparse triangular matrices , 2009, Computing.
[14] Shengen Yan,et al. A Cross-Platform SpMV Framework on Many-Core Architectures , 2016, ACM Trans. Archit. Code Optim..
[15] Ninghui Sun,et al. SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication , 2013, PLDI.
[16] Brian Vinter,et al. A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves , 2016, Euro-Par.
[17] Jonathan D. Hogg. A Fast Dense Triangular Solve in CUDA , 2013, SIAM J. Sci. Comput..
[18] Weifeng Liu,et al. Parallel and Scalable Sparse Basic Linear Algebra Subprograms , 2016 .
[19] Yousef Saad,et al. GPU-accelerated preconditioned iterative linear solvers , 2013, The Journal of Supercomputing.
[20] Edmond Chow,et al. Batched Generation of Incomplete Sparse Approximate Inverses on GPUs , 2016, 2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA).
[21] Hao Wang,et al. Exploring and Analyzing the Real Impact of Modern On-Package Memory on HPC Scientific Kernels , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.
[22] Pradeep Dubey,et al. Sparsifying Synchronization for High-Performance Shared-Memory Sparse Triangular Solver , 2014, ISC.
[23] Kevin Skadron,et al. Dymaxion: Optimizing memory access patterns for heterogeneous systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[24] Edmond Chow,et al. Iterative Sparse Triangular Solves for Preconditioning , 2015, Euro-Par.
[25] John K. Reid,et al. The design of MA48: a code for the direct solution of sparse unsymmetric linear systems of equations , 1996, TOMS.
[26] Henk Corporaal,et al. Adaptive and transparent cache bypassing for GPUs , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.
[27] Erik G. Boman,et al. Factors Impacting Performance of Multithreaded Sparse Triangular Solve , 2010, VECPAR.
[28] Jack J. Dongarra,et al. Incomplete Sparse Approximate Inverses for Parallel Preconditioning , 2018, Parallel Comput..
[29] J. Navarro-Pedreño. Numerical Methods for Least Squares Problems , 1996 .
[30] Santa Clara,et al. Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU , 2011 .
[31] Yousef Saad,et al. Solving Sparse Triangular Linear Systems on Parallel Computers , 1989, Int. J. High Speed Comput..
[32] Stefanos Kaxiras,et al. Callback: Efficient synchronization without invalidation with a directory just for spin-waiting , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[33] Weimin Zheng,et al. A Fast Tridiagonal Solver for Intel MIC Architecture , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[34] Yves Robert,et al. STS-k: a multilevel sparse triangular solution scheme for NUMA multicores , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.
[35] Iain S. Duff,et al. An overview of the sparse basic linear algebra subprograms: The new standard from the BLAS technical forum , 2002, TOMS.
[36] Timothy A. Davis,et al. Direct methods for sparse linear systems , 2006, Fundamentals of algorithms.
[37] I. Duff,et al. Direct Methods for Sparse Matrices , 1987 .
[38] Eric C. Kerrigan,et al. Balancing Locality and Concurrency: Solving Sparse Triangular Systems on GPUs , 2016, 2016 IEEE 23rd International Conference on High Performance Computing (HiPC).
[39] Weifeng Liu,et al. Parallel Transposition of Sparse Data Structures , 2016, ICS.
[40] Laxmikant V. Kalé,et al. Structure-adaptive parallel solution of sparse triangular linear systems , 2014, Parallel Comput..
[41] Hong Zhang,et al. Sparse triangular solves for ILU revisited: data layout crucial to better performance , 2011, Int. J. High Perform. Comput. Appl..
[42] Yousef Saad,et al. Iterative methods for sparse linear systems , 2003 .
[43] Timothy A. Davis,et al. The university of Florida sparse matrix collection , 2011, TOMS.
[44] Adam Morrison. Scaling Synchronization in Multicore Programs , 2016, ACM Queue.
[45] Edmond Chow,et al. Fine-Grained Parallel Incomplete LU Factorization , 2015, SIAM J. Sci. Comput..
[46] Henk Corporaal,et al. Supplementary Materials to Adaptive and Transparent Cache Bypassing for GPUs , 2015 .
[47] Brian Vinter,et al. A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors , 2015, J. Parallel Distributed Comput..
[48] Shengen Yan,et al. StreamScan: fast scan algorithms for GPUs without global barrier synchronization , 2013, PPoPP '13.