CapelliniSpTRSV: A Thread-Level Synchronization-Free Sparse Triangular Solve on GPUs

Sparse triangular solves (SpTRSVs) have been extensively used in linear algebra fields, and many GPU-based SpTRSV algorithms have been proposed. Synchronization-free SpTRSVs, due to their short preprocessing time and high performance, are currently the most popular SpTRSV algorithms. However, we observe that the performance of those SpTRSV algorithms on different matrices can vary greatly by 845 times. Our further studies show that when the average number of components per level is high and the average number of nonzero elements per row is low, those SpTRSVs exhibit extremely low performance. The reason is that, they use a warp on the GPU to process a row in sparse matrices, and such warp-level designs have severe underutilization of the GPU. To solve this problem, we propose CapelliniSpTRSV, a thread-level synchronization-free SpTRSV algorithm. Particularly, CapelliniSpTRSV has three novel features. First, unlike the previous studies, CapelliniSpTRSV does not need preprocessing to calculate levels. Second, CapelliniSpTRSV exhibits high performance on matrices that previous SpTRSVs cannot handle efficiently. Third, CapelliniSpTRSV’s optimization does not rely on specific sparse matrix storage format. Instead, it can achieve very good performance on the most popular sparse matrix storage, compressed sparse row (CSR) format, and thus users do not need to conduct format conversion. We evaluate CapelliniSpTRSV with 245 matrices from the Florida Sparse Matrix Collection on three GPU platforms, and experiments show that our SpTRSV exhibits 6.84 GFLOPS/s, which is 4.97x speedup over the state-of-the-art synchronization-free SpTRSV algorithm, and 4.74x speedup over the SpTRSV in cuSPARSE. CapelliniSpTRSV is open-sourced in https://github.com/JiyaSu/CapelliniSpTRSV.

[1]  Joseph L. Greathouse,et al.  Efficient Sparse Matrix-Vector Multiplication on GPUs Using the CSR Storage Format , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Yogesh L. Simmhan,et al.  GoFFish: A Sub-graph Centric Framework for Large-Scale Graph Analytics , 2013, Euro-Par.

[3]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[4]  Kamesh Madduri,et al.  Parallel breadth-first search on distributed memory systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[5]  J. Navarro-Pedreño Numerical Methods for Least Squares Problems , 1996 .

[6]  Wei Zhang,et al.  ICR: in-cache replication for enhancing data cache reliability , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[7]  Keshav Pingali,et al.  Optimistic parallelism requires abstractions , 2007, PLDI '07.

[8]  Shuaiwen Song,et al.  Locality-Driven Dynamic GPU Cache Bypassing , 2015, ICS.

[9]  Edmond Chow,et al.  Iterative Sparse Triangular Solves for Preconditioning , 2015, Euro-Par.

[10]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[11]  Xiang Pan,et al.  Using STT-RAM to enable energy-efficient near-threshold chip multiprocessors , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[12]  Yousef Saad,et al.  Solving Sparse Triangular Linear Systems on Parallel Computers , 1989, Int. J. High Speed Comput..

[13]  Jack J. Dongarra,et al.  Feedback-directed thread scheduling with memory considerations , 2007, HPDC '07.

[14]  David A. Bader,et al.  Approximating Betweenness Centrality , 2007, WAW.

[15]  Weifeng Liu,et al.  Efficient Block Algorithms for Parallel Sparse Triangular Solve , 2020, ICPP.

[16]  Xinyu Li,et al.  Hierarchical Hybrid Memory Management in OS for Tiered Memory Systems , 2019, IEEE Transactions on Parallel and Distributed Systems.

[17]  Lizy Kurian John,et al.  Minimalist open-page: A DRAM page-mode scheduling policy for the many-core era , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18]  Weifeng Liu,et al.  Parallel and Scalable Sparse Basic Linear Algebra Subprograms , 2016 .

[19]  Yousef Saad,et al.  GPU-accelerated preconditioned iterative linear solvers , 2013, The Journal of Supercomputing.

[20]  Joseph L. Greathouse,et al.  Structural Agnostic SpMV: Adapting CSR-Adaptive for Irregular Matrices , 2015, 2015 IEEE 22nd International Conference on High Performance Computing (HiPC).

[21]  Brian Vinter,et al.  Fast synchronization‐free algorithms for parallel sparse triangular solves with multiple right‐hand sides , 2017, Concurr. Comput. Pract. Exp..

[22]  Pablo Ezzatti,et al.  Solving Sparse Triangular Linear Systems in Modern GPUs: A Synchronization-Free Algorithm , 2018, 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP).

[23]  Brian Vinter,et al.  A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves , 2016, Euro-Par.

[24]  Eric C. Kerrigan,et al.  Balancing Locality and Concurrency: Solving Sparse Triangular Systems on GPUs , 2016, 2016 IEEE 23rd International Conference on High Performance Computing (HiPC).

[25]  Andrei Z. Broder,et al.  Workshop on Algorithms and Models for the Web Graph , 2007, WAW.

[26]  Mehmet Deveci,et al.  Sparse Matrix-Matrix Multiplication for Modern Architectures , 2016 .

[27]  Weifeng Liu,et al.  Parallel Transposition of Sparse Data Structures , 2016, ICS.

[28]  Brian Vinter,et al.  A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors , 2015, J. Parallel Distributed Comput..

[29]  Pradeep Dubey,et al.  Sparsifying Synchronization for High-Performance Shared-Memory Sparse Triangular Solver , 2014, ISC.

[30]  Padma Raghavan,et al.  Adapting Sparse Triangular Solution to GPUs , 2012, 2012 41st International Conference on Parallel Processing Workshops.

[31]  Srinivasan Parthasarathy,et al.  Automatic Selection of Sparse Matrix Representation on GPUs , 2015, ICS.

[32]  Mingsong Chen,et al.  OO-VR: NUMA Friendly Object-Oriented VR Rendering Framework For Future NUMA-Based Multi-GPU Systems , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[33]  Xinyu Li,et al.  Thinking about A New Mechanism for Huge Page Management , 2019, APSys '19.

[34]  Mircea R. Stan,et al.  Relaxing non-volatility for fast and energy-efficient STT-RAM caches , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[35]  Weifeng Liu,et al.  swSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures , 2018, PPoPP.

[36]  Xian-He Sun,et al.  DaCache: Memory Divergence-Aware GPU Cache Management , 2015, ICS.

[37]  Yogesh L. Simmhan,et al.  Distributed Programming over Time-Series Graphs , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[38]  Brian Vinter,et al.  Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors , 2015, Parallel Comput..

[39]  Brian Vinter,et al.  CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication , 2015, ICS.

[40]  Bo Wu,et al.  Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU , 2013, PPoPP '13.

[41]  Bingsheng He,et al.  Efficient gather and scatter operations on graphics processors , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[42]  I. Duff,et al.  Direct Methods for Sparse Matrices , 1987 .

[43]  Joel H. Saltz,et al.  Aggregation Methods for Solving Sparse Triangular Systems on Multiprocessors , 1990, SIAM J. Sci. Comput..

[44]  Santa Clara,et al.  Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU , 2011 .

[45]  Surendra Byna,et al.  Core-aware memory access scheduling schemes , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[46]  Keshav Pingali,et al.  The tao of parallelism in algorithms , 2011, PLDI '11.

[47]  Xiaoyong Du,et al.  Performance evaluation and analysis of sparse matrix and graph kernels on heterogeneous processors , 2019, CCF Transactions on High Performance Computing.

[48]  Yinglong Xia,et al.  C-Graph: A Highly Efficient Concurrent Graph Reachability Query Framework , 2018, ICPP.

[49]  Jack J. Dongarra,et al.  L2 Cache Modeling for Scientific Applications on Chip Multi-Processors , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).

[50]  Xipeng Shen,et al.  On-the-fly elimination of dynamic irregularities for GPU computing , 2011, ASPLOS XVI.