Communication Avoiding Symmetric Band Reduction

The running time of an algorithm depends on both arithmetic and communication (i.e., data movement) costs, and the relative costs of communication are growing over time. In this work, we present both theoretical and practical results for tridiagonalizing a symmetric band matrix: we present an algorithm that asymptotically reduces communication, and we show that it indeed performs well in practice. The tridiagonalization of a symmetric band matrix is a key kernel in solving the symmetric eigenvalue problem for both full and band matrices. In order to preserve sparsity, tridiagonalization routines use annihilate-and-chase procedures that previously have suffered from poor data locality. We improve data locality by reorganizing the computation, asymptotically reducing communication costs compared to existing algorithms. Our sequential implementation demonstrates that avoiding communication improves runtime even at the expense of extra arithmetic: we observe a 2⇥ speedup over Intel MKL while doing 43% more floating point operations. Our parallel implementation targets shared-memory multicore platforms. It uses pipelined parallelism and a static scheduler while retaining the locality properties of the sequential algorithm. Due to lightweight synchronization and effective data reuse, we see 9.5⇥ scaling over our serial code and up to 6⇥ speedup over the PLASMA library, comparing parallel performance on a ten-core processor.

[1]  Jack J. Dongarra,et al.  Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[2]  H. Schwarz Tridiagonalization of a symetric band matrix , 1968 .

[3]  Samuel H. Fuller,et al.  The Future of Computing Performance: Game Over or Next Level? , 2014 .

[4]  Linda Kaufman,et al.  Banded Eigenvalue Solvers on Vector Machines , 1984, TOMS.

[5]  Bruno Lang Efficient eigenvalue and singular value computations on shared memory machines , 1999, Parallel Comput..

[6]  James Demmel,et al.  Cache efficient bidiagonalization using BLAS 2.5 operators , 2008, TOMS.

[7]  Bruno Lang,et al.  A Parallel Algorithm for Reducing Symmetric Banded Matrices to Tridiagonal Form , 1993, SIAM J. Sci. Comput..

[8]  Jack J. Dongarra,et al.  Two-Stage Tridiagonal Reduction for Dense Symmetric Matrices Using Tile Algorithms on Multicore Architectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[9]  Christian H. Bischof,et al.  Algorithm 807: The SBR Toolbox—software for successive band reduction , 2000, TOMS.

[10]  D. Sorensen,et al.  LAPACK Working Note No. 2: Block reduction of matrices to condensed forms for eigenvalue computations , 1987 .

[11]  James Demmel,et al.  Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..

[12]  James Demmel,et al.  Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[13]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[14]  Christian H. Bischof,et al.  A framework for symmetric band reduction , 2000, TOMS.

[15]  Linda Kaufman Band reduction algorithms revisited , 2000, TOMS.

[16]  Sivasankaran Rajamanickam,et al.  EFFICIENT ALGORITHMS FOR SPARSE SINGULAR VALUE DECOMPOSITION , 2009 .

[17]  K. Murata,et al.  A New Method for the Tridiagonalization of the Symmetric Band Matrix , 1975 .

[18]  Jack J. Dongarra,et al.  High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures , 2013, TOMS.