Sparsifying Synchronization for High-Performance Shared-Memory Sparse Triangular Solver

The last decade has seen rapid growth of single-chip multiprocessors CMPs, which have been leveraging Moore's law to deliver high concurrency via increases in the number of cores and vector width. Modern CMPs execute from several hundreds to several thousands concurrent operations per second, while their memory subsystem delivers from tens to hundreds Giga-bytes per second bandwidth. Taking advantage of these parallel resources requires highly tuned parallel implementations of key computational kernels, which form the back-bone of modern HPC. Sparse triangular solver is one such kernel and is the focus of this paper. It is widely used in several types of sparse linear solvers, and it is commonly considered challenging to parallelize and scale even on a moderate number of cores. This challenge is due to the fact that triangular solver typically has limited task-level parallelism and relies on fine-grain synchronization to exploit this parallelism, compared to data-parallel operations such as sparse matrix-vector multiplication. This paper presents synchronization sparsification technique that significantly reduces the overhead of synchronization in sparse triangular solver and improves its scalability. We discover that a majority of task dependencies are redundant in task dependency graphs which are used to model the flow of computation in sparse triangular solver. We propose a fast and approximate sparsification algorithm, which eliminates more than 90% of these dependencies, substantially reducing synchronization overhead. As a result, on a 12-core Intel® Xeon® processor, our approach improves the performance of sparse triangular solver by 1.6x, compared to the conventional level-scheduling with barrier synchronization. This, in turn, leads to a 1.4x speedup in a pre-conditioned conjugate gradient solver.

[1]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[2]  João Correia Lopes,et al.  High Performance Computing for Computational Science - VECPAR 2010 - 9th International conference, Berkeley, CA, USA, June 22-25, 2010, Revised Selected Papers , 2011, VECPAR.

[3]  E. L. Poole,et al.  Multicolor ICCG methods for vector computers , 1987 .

[4]  Anoop Gupta,et al.  Parallel ICCG on a hierarchical memory multiprocessor - Addressing the triangular solve bottleneck , 1990, Parallel Comput..

[5]  Hong Zhang,et al.  Sparse triangular solves for ILU revisited: data layout crucial to better performance , 2011, Int. J. High Perform. Comput. Appl..

[6]  J. Meijerink,et al.  An iterative solution method for linear systems of which the coefficient matrix is a symmetric -matrix , 1977 .

[7]  V. E. Henson,et al.  BoomerAMG: a parallel algebraic multigrid solver and preconditioner , 2002 .

[8]  Ronald L. Graham,et al.  Bounds on Multiprocessing Timing Anomalies , 1969, SIAM Journal of Applied Mathematics.

[9]  Debra Hensgen,et al.  Two algorithms for barrier synchronization , 1988, International Journal of Parallel Programming.

[10]  Pradeep Dubey,et al.  Fast and Efficient Graph Traversal Algorithm for CPUs: Maximizing Single-Node Efficiency , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[11]  Tinkara Toš,et al.  Graph Algorithms in the Language of Linear Algebra , 2012, Software, environments, tools.

[12]  Robert A. van de Geijn,et al.  Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures , 2007, SPAA '07.

[13]  Yousef Saad,et al.  Solving Sparse Triangular Linear Systems on Parallel Computers , 1989, Int. J. High Speed Comput..

[14]  Hiroshi Nakashima,et al.  Algebraic Block Multi-Color Ordering Method for Parallel Multi-Threaded Sparse Triangular Solver in ICCG Method , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[15]  M. Hestenes,et al.  Methods of conjugate gradients for solving linear systems , 1952 .

[16]  Victor Eijkhout,et al.  A Parallel Sparse Direct Solver via Hierarchical DAG Scheduling , 2014, ACM Trans. Math. Softw..

[17]  Joel H. Saltz,et al.  Run-time parallelization and scheduling of loops , 1989, SPAA '89.

[18]  William J. Dally,et al.  Buffer-space efficient and deadlock-free scheduling of stream applications on multi-core architectures , 2010, SPAA '10.

[19]  Sandia Report,et al.  Toward a New Metric for Ranking High Performance Computing Systems , 2013 .

[20]  Harry T. Hsu,et al.  An Algorithm for Finding a Minimal Equivalent Graph of a Digraph , 1975, JACM.

[21]  Joel H. Saltz,et al.  Aggregation Methods for Solving Sparse Triangular Systems on Multiprocessors , 1990, SIAM J. Sci. Comput..

[22]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[23]  Jan Mayer,et al.  Parallel algorithms for solving linear systems with sparse triangular matrices , 2009, Computing.

[24]  Santa Clara,et al.  Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU , 2011 .

[25]  Matthias S. Müller,et al.  Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[26]  T. C. Hu Parallel Sequencing and Assembly Line Problems , 1961 .

[27]  Erik G. Boman,et al.  Factors Impacting Performance of Multithreaded Sparse Triangular Solve , 2010, VECPAR.