Avoiding Communication in Successive Band Reduction

The running time of an algorithm depends on both arithmetic and communication (i.e., data movement) costs, and the relative costs of communication are growing over time. In this work, we present sequential and distributed-memory parallel algorithms for tridiagonalizing full symmetric and symmetric band matrices that asymptotically reduce communication compared to previous approaches. The tridiagonalization of a symmetric band matrix is a key kernel in solving the symmetric eigenvalue problem for both full and band matrices. In order to preserve structure, tridiagonalization routines use annihilate-and-chase procedures that previously have suffered from poor data locality and high parallel latency cost. We improve both by reorganizing the computation and obtain asymptotic improvements. We also propose new algorithms for reducing a full symmetric matrix to band form in a communication-efficient manner. In this article, we consider the cases of computing eigenvalues only and of computing eigenvalues and all eigenvectors.

[1]  Xiaobai Sun,et al.  Parallel tridiagonalization through two-step band reduction , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[2]  James Demmel,et al.  Communication efficient gaussian elimination with partial pivoting using a shape morphing data layout , 2013, SPAA.

[3]  Lukas Krämer,et al.  Developing algorithms and software for the parallel solution of the symmetric eigenvalue problem , 2011, J. Comput. Sci..

[4]  Jack J. Dongarra,et al.  Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[5]  Wilfried N. Gansterer,et al.  Multi-sweep Algorithms for the Symmetric Eigenproblem , 1998, VECPAR.

[6]  Sivasankaran Rajamanickam,et al.  EFFICIENT ALGORITHMS FOR SPARSE SINGULAR VALUE DECOMPOSITION , 2009 .

[7]  James Demmel,et al.  Performance and Accuracy of LAPACK's Symmetric Tridiagonal Eigensolvers , 2008, SIAM J. Sci. Comput..

[8]  H. Schwarz Tridiagonalization of a symetric band matrix , 1968 .

[9]  F. V. Zee Restructuring the QR Algorithm for Performance , 2011 .

[10]  James Demmel,et al.  Communication avoiding successive band reduction , 2012, PPoPP '12.

[11]  Bruno Lang,et al.  Parallel Reduction of Banded Matrices to Bidiagonal Form , 1996, Parallel Comput..

[12]  Enrique S. Quintana-Ortí,et al.  Reduction to Condensed Forms for Symmetric Eigenvalue Problems on Multi-core Architectures , 2009, PPAM.

[13]  D. Sorensen,et al.  Block reduction of matrices to condensed forms for eigenvalue computations , 1990 .

[14]  C. Loan,et al.  A Storage-Efficient $WY$ Representation for Products of Householder Transformations , 1989 .

[15]  Bruno Lang Efficient eigenvalue and singular value computations on shared memory machines , 1999, Parallel Comput..

[16]  K. Murata,et al.  A New Method for the Tridiagonalization of the Symmetric Band Matrix , 1975 .

[17]  Piotr Luszczek,et al.  An improved parallel singular value algorithm and its implementation for multicore hardware , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[18]  Jack J. Dongarra,et al.  Toward a High Performance Tile Divide and Conquer Algorithm for the Dense Symmetric Eigenvalue Problem , 2012, SIAM J. Sci. Comput..

[19]  James Demmel,et al.  Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[20]  Lukas Krämer,et al.  Parallel solution of partial symmetric eigenvalue problems from electronic structure calculations , 2011, Parallel Comput..

[21]  J. H. Wilkinson Calculation of the eigenvalues of a symmetric tridiagonal matrix by the method of bisection , 1962 .

[22]  Bruno Lang,et al.  Efficient parallel reduction to bidiagonal form , 1999, Parallel Comput..

[23]  Jack J. Dongarra,et al.  High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures , 2013, TOMS.

[24]  B GibbonsPhillip ACM transactions on parallel computing , 2014 .

[25]  Karen S. Braman,et al.  The Multishift QR Algorithm. Part I: Maintaining Well-Focused Shifts and Level 3 Performance , 2001, SIAM J. Matrix Anal. Appl..

[26]  Bruno Lang,et al.  A Parallel Algorithm for Reducing Symmetric Banded Matrices to Tridiagonal Form , 1993, SIAM J. Sci. Comput..

[27]  Jack J. Dongarra,et al.  A novel hybrid CPU–GPU generalized eigensolver for electronic structure calculations based on fine-grained memory aware tasks , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[28]  Thomas Auckenthaler,et al.  Highly scalable eigensolvers for petaflop applications , 2012 .

[29]  Jack J. Dongarra,et al.  Two-Stage Tridiagonal Reduction for Dense Symmetric Matrices Using Tile Algorithms on Multicore Architectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[30]  Christian H. Bischof,et al.  Algorithm 807: The SBR Toolbox—software for successive band reduction , 2000, TOMS.

[31]  B. Parlett,et al.  Multiple representations to compute orthogonal eigenvectors of symmetric tridiagonal matrices , 2004 .

[32]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[33]  Jehoshua Bruck,et al.  Efficient algorithms for all-to-all communications in multi-port message-passing systems , 1994, SPAA '94.

[34]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[35]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .

[36]  James Demmel,et al.  Reconstructing Householder Vectors from Tall-Skinny QR , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[37]  Lars Karlsson,et al.  Parallel two-stage reduction to Hessenberg form using dynamic scheduling on shared-memory architectures , 2011, Parallel Comput..

[38]  Linda Kaufman,et al.  Banded Eigenvalue Solvers on Vector Machines , 1984, TOMS.

[39]  Linda Kaufman Band reduction algorithms revisited , 2000, TOMS.

[40]  James Demmel,et al.  Cache efficient bidiagonalization using BLAS 2.5 operators , 2008, TOMS.

[41]  Daniel Kressner,et al.  A Novel Parallel QR Algorithm for Hybrid Distributed Memory HPC Systems , 2010, SIAM J. Sci. Comput..

[42]  Smith,et al.  A Parallel Algorithm for Householder TridiagonalizationChristopher , 1994 .

[43]  James Demmel,et al.  Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..

[44]  James Hardy Wilkinson,et al.  The QR and QL Algorithms for Symmetric Matrices , 1971 .

[45]  James Hardy Wilkinson,et al.  Householder's method for symmetric matrices , 1962 .

[46]  James Demmel,et al.  LAPACK Users' Guide, Third Edition , 1999, Software, Environments and Tools.

[47]  J. Cuppen A divide and conquer method for the symmetric tridiagonal eigenproblem , 1980 .

[48]  Jack J. Dongarra,et al.  A Comprehensive Study of Task Coalescing for Selecting Parallelism Granularity in a Two-Stage Bidiagonal Reduction , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[49]  Jack J. Dongarra,et al.  Tridiagonalization of a Symmetric Dense Matrix on a GPU Cluster , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[50]  H. Rutishauser On jacobi rotation patterns , 1963 .

[51]  Jack Dongarra,et al.  ScaLAPACK Users' Guide , 1987 .

[52]  Samuel H. Fuller,et al.  Computing Performance: Game Over or Next Level? , 2011, Computer.

[53]  Robert A. van de Geijn,et al.  Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance , 2014, ACM Trans. Math. Softw..

[54]  Jack J. Dongarra,et al.  A novel hybrid CPU–GPU generalized eigensolver for electronic structure calculations based on fine-grained memory aware tasks , 2014, Int. J. High Perform. Comput. Appl..

[55]  J. H. Wilkinson,et al.  TheQR andQL algorithms for symmetric matrices , 1968 .

[56]  Christian H. Bischof,et al.  A framework for symmetric band reduction , 2000, TOMS.

[57]  Christian H. Bischof,et al.  Parallel Bandreduction and Tridiagonalization , 1993, PPSC.

[58]  Samuel H. Fuller,et al.  The Future of Computing Performance: Game Over or Next Level? , 2014 .

[59]  C. H. Bischof,et al.  A framework for symmetric band reduction and tridiagonalization , 1994 .