Performance Optimizations and Bounds for Sparse Symmetric Matrix-Multiple Vector Multiply

We present performance optimizations and detailed performance analyses of sparse matrix-vector multiply SpMV and its generalization to multiple vectors, SpMM, when the matrix is symmetric. In addition to the traditional benefit of reduced storage from symmetry, we show considerable performance benefits are possible on modern architectures where the cost of memory access dominates the cost of other operations. We analyze the effects of (1) symmetric storage, (2) register-level blocked matrix storage, and (3) register-level vector blocking for multiple vectors, when applied individually and applied in conjunction. Compared to the best register blocked implementations that ignore symmetry, we show 2.1× speedups for SpMV and 2.6× speedups for SpMM. Compared to the most näıve implementation in which none of the three optimizations are applied, our best implementations are up to 9.9× faster for a dense matrix in sparse format and up to 7.3× faster for a true sparse matrix. We evaluate our implementations with respect to upper bounds on their absolute performance in Mflop/s. Our performance model, which extends our prior bounds for SpMV in the non-symmetric case, bounds performance by considering only the cost of memory operations, and using lower bounds on cache misses. On four different platforms and a suite of twelve symmetric matrices spanning a variety of applications, we find our implementations are within 60% of the upper bounds on average. This fraction is smaller than what we have previously observed for nonsymmetric SpMV, suggesting two possible avenues for future work: (1) additional refinements to the performance bounds that explicitly model low-level code generation (e.g., register pressure or instruction scheduling/selection/bandwidth issues), and (2) a new opportunity to apply automated low-level tuning techniques in the spirit of ATLAS/PHiPAC systems to symmetric sparse kernels.