Fast synchronization‐free algorithms for parallel sparse triangular solves with multiple right‐hand sides

The sparse triangular solve kernels, SpTRSV and SpTRSM, are important building blocks for a number of numerical linear algebra routines. Parallelizing SpTRSV and SpTRSM on today's manycore platforms, such as GPUs, is not an easy task since computing a component of the solution may depend on previously computed components, enforcing a degree of sequential processing. As a consequence, most existing work introduces a preprocessing stage to partition the components into a group of level‐sets or colour‐sets so that components within a set are independent and can be processed simultaneously during the subsequent solution stage. However, this class of methods requires a long preprocessing time as well as significant runtime synchronization overheads between the sets. To address this, we propose in this paper novel approaches for SpTRSV and SpTRSM in which the ordering between components is naturally enforced within the solution stage. In this way, the cost for preprocessing can be greatly reduced, and the synchronizations between sets are completely eliminated. To further exploit the data‐parallelism, we also develop an adaptive scheme for efficiently processing multiple right‐hand sides in SpTRSM. A comparison with a state‐of‐the‐art library supplied by the GPU vendor, using 20 sparse matrices on the latest GPU device, shows that the proposed approach obtains an average speedup of over two for SpTRSV and up to an order of magnitude speedup for SpTRSM. In addition, our method is up to two orders of magnitude faster for the preprocessing stage than existing SpTRSV and SpTRSM methods.

[1]  Wu-chun Feng,et al.  Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures , 2015, ICPE.

[2]  Henk Corporaal,et al.  Locality-Aware CTA Clustering for Modern GPUs , 2017, ASPLOS.

[3]  Milos Prvulovic,et al.  MiSAR: Minimalistic synchronization accelerator with resource overflow management , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[4]  Edmond Chow,et al.  Domain Overlap for Iterative Sparse Triangular Solves on GPUs , 2016, Software for Exascale Computing.

[5]  Henk Corporaal,et al.  Fine-Grained Synchronizations and Dataflow Programming on GPUs , 2015, ICS.

[6]  David P. Anderson Preserving hybrid objects , 2016, Commun. ACM.

[7]  Brian Vinter,et al.  Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors , 2015, Parallel Comput..

[8]  Wu-chun Feng,et al.  Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[9]  Padma Raghavan,et al.  Adapting Sparse Triangular Solution to GPUs , 2012, 2012 41st International Conference on Parallel Processing Workshops.

[10]  Joel H. Saltz,et al.  Aggregation Methods for Solving Sparse Triangular Systems on Multiprocessors , 1990, SIAM J. Sci. Comput..

[11]  Brian Vinter,et al.  CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication , 2015, ICS.

[12]  Weifeng Liu,et al.  Fast segmented sort on GPUs , 2017, ICS.

[13]  Jan Mayer,et al.  Parallel algorithms for solving linear systems with sparse triangular matrices , 2009, Computing.

[14]  Shengen Yan,et al.  A Cross-Platform SpMV Framework on Many-Core Architectures , 2016, ACM Trans. Archit. Code Optim..

[15]  Ninghui Sun,et al.  SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication , 2013, PLDI.

[16]  Brian Vinter,et al.  A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves , 2016, Euro-Par.

[17]  Jonathan D. Hogg A Fast Dense Triangular Solve in CUDA , 2013, SIAM J. Sci. Comput..

[18]  Weifeng Liu,et al.  Parallel and Scalable Sparse Basic Linear Algebra Subprograms , 2016 .

[19]  Yousef Saad,et al.  GPU-accelerated preconditioned iterative linear solvers , 2013, The Journal of Supercomputing.

[20]  Edmond Chow,et al.  Batched Generation of Incomplete Sparse Approximate Inverses on GPUs , 2016, 2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA).

[21]  Hao Wang,et al.  Exploring and Analyzing the Real Impact of Modern On-Package Memory on HPC Scientific Kernels , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  Pradeep Dubey,et al.  Sparsifying Synchronization for High-Performance Shared-Memory Sparse Triangular Solver , 2014, ISC.

[23]  Kevin Skadron,et al.  Dymaxion: Optimizing memory access patterns for heterogeneous systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[24]  Edmond Chow,et al.  Iterative Sparse Triangular Solves for Preconditioning , 2015, Euro-Par.

[25]  John K. Reid,et al.  The design of MA48: a code for the direct solution of sparse unsymmetric linear systems of equations , 1996, TOMS.

[26]  Henk Corporaal,et al.  Adaptive and transparent cache bypassing for GPUs , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[27]  Erik G. Boman,et al.  Factors Impacting Performance of Multithreaded Sparse Triangular Solve , 2010, VECPAR.

[28]  Jack J. Dongarra,et al.  Incomplete Sparse Approximate Inverses for Parallel Preconditioning , 2018, Parallel Comput..

[29]  J. Navarro-Pedreño Numerical Methods for Least Squares Problems , 1996 .

[30]  Santa Clara,et al.  Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU , 2011 .

[31]  Yousef Saad,et al.  Solving Sparse Triangular Linear Systems on Parallel Computers , 1989, Int. J. High Speed Comput..

[32]  Stefanos Kaxiras,et al.  Callback: Efficient synchronization without invalidation with a directory just for spin-waiting , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[33]  Weimin Zheng,et al.  A Fast Tridiagonal Solver for Intel MIC Architecture , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[34]  Yves Robert,et al.  STS-k: a multilevel sparse triangular solution scheme for NUMA multicores , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[35]  Iain S. Duff,et al.  An overview of the sparse basic linear algebra subprograms: The new standard from the BLAS technical forum , 2002, TOMS.

[36]  Timothy A. Davis,et al.  Direct methods for sparse linear systems , 2006, Fundamentals of algorithms.

[37]  I. Duff,et al.  Direct Methods for Sparse Matrices , 1987 .

[38]  Eric C. Kerrigan,et al.  Balancing Locality and Concurrency: Solving Sparse Triangular Systems on GPUs , 2016, 2016 IEEE 23rd International Conference on High Performance Computing (HiPC).

[39]  Weifeng Liu,et al.  Parallel Transposition of Sparse Data Structures , 2016, ICS.

[40]  Laxmikant V. Kalé,et al.  Structure-adaptive parallel solution of sparse triangular linear systems , 2014, Parallel Comput..

[41]  Hong Zhang,et al.  Sparse triangular solves for ILU revisited: data layout crucial to better performance , 2011, Int. J. High Perform. Comput. Appl..

[42]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[43]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[44]  Adam Morrison Scaling Synchronization in Multicore Programs , 2016, ACM Queue.

[45]  Edmond Chow,et al.  Fine-Grained Parallel Incomplete LU Factorization , 2015, SIAM J. Sci. Comput..

[46]  Henk Corporaal,et al.  Supplementary Materials to Adaptive and Transparent Cache Bypassing for GPUs , 2015 .

[47]  Brian Vinter,et al.  A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors , 2015, J. Parallel Distributed Comput..

[48]  Shengen Yan,et al.  StreamScan: fast scan algorithms for GPUs without global barrier synchronization , 2013, PPoPP '13.