Column-Segmented Sparse Matrix-Matrix Multiplication on Multicore CPUs

Sparse general matrix-matrix multiplication, SpGEMM, is one of the most fundamental yet challenging sparse computation kernels. Due to its irregular computation pattern, SpGEMM frequently becomes the performance bottleneck in many scientific applications. Many prior state-of-the-art approaches use either dense or sparse accumulators to merge matrix rows as a critical component. Dense accumulators are efficient for small matrices but are infeasible for large or highly sparse matrices, due to high memory use and low cache efficiency. In this work, by segmenting the columns for the second input matrix, we propose a new SpGEMM algorithm that utilizes both a new sparse high-level overview of the matrix and fast and small dense accumulators that would fit in cache. With that, our approach brings the dense accumulator benefits to both large and highly sparse matrices. Our extensive experimental evaluation, carried out on three hardware platforms and on hundreds of sparse matrices from a variety of domains, shows that our algorithm out-performs state-of-the-art SpGEMM implementations.