Conflict-free symmetric sparse matrix-vector multiplication on multicore architectures

Exploiting the numeric symmetry in sparse matrices to reduce their memory footprint is very tempting for optimizing the memory-bound Sparse Matrix-Vector Multiplication (SpMV) kernel. Despite being very beneficial for serial computation, storing the upper or lower triangular part of the matrix introduces race conditions in the updates to the output vector in a parallel execution. Previous work has suggested using local, per-thread vectors to circumvent this problem, introducing a work-inefficient reduction step that limits the scalability of SpMV. In this paper, we address this issue with Conflict-Free Symmetric (CFS) SpMV, an optimization strategy that organizes the parallel computation into phases of conflict-free execution. We identify such phases through graph coloring and propose heuristics to improve the coloring quality for SpMV in terms of load balancing and locality to the input and output vectors. We evaluate our approach on two multicore shared-memory systems and demonstrate improved performance over the state-of-the-art.

[1]  Nectarios Koziris,et al.  Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Intel Xeon Phi , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[2]  Andreas Frommer,et al.  Block colouring schemes for the SOR method on local memory parallel computers , 1990, Parallel Comput..

[3]  Ümit V. Çatalyürek,et al.  A framework for scalable greedy coloring on distributed-memory parallel computers , 2008, J. Parallel Distributed Comput..

[4]  Leland L. Beck,et al.  Smallest-last ordering and clustering and graph coloring algorithms , 1983, JACM.

[5]  Y. Saad Numerical Methods for Large Eigenvalue Problems , 2011 .

[6]  Kivanc Dincer,et al.  A Comparison of Parallel Graph Coloring Algorithms , 1995 .

[7]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[8]  Michael B. Giles,et al.  Renumbering unstructured grids to improve the performance of codes on hierarchical memory machines , 1997 .

[9]  Mark T. Jones,et al.  A Parallel Graph Coloring Heuristic , 1993, SIAM J. Sci. Comput..

[10]  I. Duff,et al.  The effect of ordering on preconditioned conjugate gradients , 1989 .

[11]  Hiroshi Nakashima,et al.  Algebraic Block Multi-Color Ordering Method for Parallel Multi-Threaded Sparse Triangular Solver in ICCG Method , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[12]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .

[13]  Vicente H. F. Batista,et al.  Parallel structurally-symmetric sparse matrix-vector products on multi-core processors , 2010, ArXiv.

[14]  P. Sadayappan,et al.  On improving the performance of sparse matrix-vector multiplication , 1997, Proceedings Fourth International Conference on High-Performance Computing.

[15]  Nectarios Koziris,et al.  Combining HTM with RCU to Speed Up Graph Coloring on Multicore Platforms , 2018, ISC.

[16]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[17]  Nectarios Koziris,et al.  Improving the Performance of the Symmetric Sparse Matrix-Vector Multiplication in Multicore , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[18]  Udo W. Pooch,et al.  A Survey of Indexing Techniques for Sparse Matrices , 1973, CSUR.

[19]  Ümit V. Çatalyürek,et al.  Graph coloring algorithms for multi-core and massively multithreaded architectures , 2012, Parallel Comput..

[20]  Jack Dongarra,et al.  The TOP500: History, Trends, and Future Directions in High Performance Computing , 2020 .

[21]  Richard W. Vuduc,et al.  Sparsity: Optimization Framework for Sparse Matrix Kernels , 2004, Int. J. High Perform. Comput. Appl..

[22]  Nectarios Koziris,et al.  BASMAT: bottleneck-aware sparse matrix-vector multiplication auto-tuning on GPGPUs , 2019, PPoPP.

[23]  Sandia Report,et al.  Improving Performance via Mini-applications , 2009 .

[24]  Ümit V. Çatalyürek,et al.  Hypergraph-Partitioning-Based Decomposition for Parallel Sparse-Matrix Vector Multiplication , 1999, IEEE Trans. Parallel Distributed Syst..

[25]  Leonid Oliker,et al.  Effects of Ordering Strategies and Programming Paradigms on Sparse Matrix Computations , 2013, SIAM Rev..

[26]  Assefaw Hadish Gebremedhin,et al.  Scalable parallel graph coloring algorithms , 2000, Concurr. Pract. Exp..

[27]  Samuel Williams,et al.  Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[28]  Joel H. Saltz,et al.  ICASE Report No . 92-12 / iVG / / ff 3 J / ICASE THE DESIGN AND IMPLEMENTATION OF A PARALLEL UNSTRUCTURED EULER SOLVER USING SOFTWARE PRIMITIVES , 2022 .

[29]  Maria Ganzha,et al.  Utilizing Recursive Storage in Sparse Matrix-Vector Multiplication - Preliminary Considerations , 2010, CATA.

[30]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[31]  Chuck Pheatt,et al.  Intel® threading building blocks , 2008 .

[32]  Gerhard Wellein,et al.  A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units , 2013, SIAM J. Sci. Comput..

[33]  Richard W. Vuduc,et al.  Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.

[34]  Mark T. Jones,et al.  Parallel Heuristics for Improved, Balanced Graph Colorings , 1996, J. Parallel Distributed Comput..

[35]  Nectarios Koziris,et al.  CSX: an extended compression format for spmv on shared memory systems , 2011, PPoPP '11.

[36]  D. J. A. Welsh,et al.  An upper bound for the chromatic number of a graph and its application to timetabling problems , 1967, Comput. J..

[37]  Nectarios Koziris,et al.  Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors , 2017, 2017 46th International Conference on Parallel Processing (ICPP).

[38]  Francisco F. Rivera,et al.  Optimization of sparse matrix-vector multiplication using reordering techniques on GPUs , 2012, Microprocess. Microsystems.

[39]  E. Cuthill,et al.  Reducing the bandwidth of sparse symmetric matrices , 1969, ACM '69.

[40]  Arutyun Avetisyan,et al.  Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures , 2010, HiPEAC.

[41]  Nectarios Koziris,et al.  SparseX: A Library for High-Performance Sparse Matrix-Vector Multiplication on Multicore Platforms , 2018, ACM Trans. Math. Softw..

[42]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[43]  Michele Martone,et al.  Efficient multithreaded untransposed, transposed or symmetric sparse matrix-vector multiplication with the Recursive Sparse Blocks format , 2014, Parallel Comput..

[44]  Charles E. Leiserson,et al.  Ordering heuristics for parallel graph coloring , 2014, SPAA.

[45]  M. Hestenes,et al.  Methods of conjugate gradients for solving linear systems , 1952 .

[46]  John R. Gilbert,et al.  Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks , 2009, SPAA '09.

[47]  Richard P. Brent,et al.  The Parallel Evaluation of General Arithmetic Expressions , 1974, JACM.

[48]  Pavel Tvrdík,et al.  Evaluation Criteria for Sparse Matrix Storage Formats , 2016, IEEE Transactions on Parallel and Distributed Systems.

[49]  Nectarios Koziris,et al.  Optimizing sparse matrix-vector multiplication using index and value compression , 2008, CF '08.

[50]  Ninghui Sun,et al.  SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication , 2013, PLDI.

[51]  Hyun Jin Moon,et al.  Fast Sparse Matrix-Vector Multiplication by Exploiting Variable Block Structure , 2005, HPCC.

[52]  Gerhard Wellein,et al.  A Recursive Algebraic Coloring Technique for Hardware-efficient Symmetric Sparse Matrix-vector Multiplication , 2019, ACM Trans. Parallel Comput..

[53]  Nectarios Koziris,et al.  Performance evaluation of the sparse matrix-vector multiplication on modern architectures , 2009, The Journal of Supercomputing.