论文信息 - A Communication-Avoiding Parallel Algorithm for the Symmetric Eigenvalue Problem

A Communication-Avoiding Parallel Algorithm for the Symmetric Eigenvalue Problem

Many large-scale scientific computations require eigenvalue solvers in a scaling regime where efficiency is limited by data movement. We introduce a parallel algorithm for computing the eigenvalues of a dense symmetric matrix, which performs asymptotically less communication than previously known approaches. We provide analysis in the Bulk Synchronous Parallel (BSP) model with additional consideration for communication between a local memory and cache. Given sufficient memory to store c copies of the symmetric matrix, our algorithm requires \Theta(\sqrt{c}) less interprocessor communication than previously known algorithms, for any c\leq p^{1/3} when using p processors. The algorithm first reduces the dense symmetric matrix to a banded matrix with the same eigenvalues. Subsequently, the algorithm employs successive reduction to O(\log p) thinner banded matrices. We employ two new parallel algorithms that achieve lower communication costs for the full-to-band and band-to-band reductions. Both of these algorithms leverage a novel QR factorization algorithm for rectangular matrices.

[1] Ramesh C. Agarwal,et al. A three-dimensional approach to parallel matrix multiplication , 1995, IBM J. Res. Dev..

[2] Jack J. Dongarra,et al. Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[3] A. Tiskin. Bulk-Synchronous Parallel Gaussian Elimination , 2002 .

[4] Sartaj Sahni,et al. Parallel Matrix and Graph Algorithms , 1981, SIAM J. Comput..

[5] Tze Meng Low,et al. Accumulating Householder transformations, revisited , 2006, TOMS.

[6] Leslie G. Valiant,et al. A bridging model for parallel computation , 1990, CACM.

[7] Matteo Frigo,et al. Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[8] Alexander Tiskin,et al. Memory-Efficient Matrix Multiplication in the BSP Model , 1999, Algorithmica.

[9] James Demmel,et al. Reconstructing Householder Vectors from Tall-Skinny QR , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[10] D. Sorensen,et al. Block reduction of matrices to condensed forms for eigenvalue computations , 1990 .

[11] James Demmel,et al. Avoiding Communication in Successive Band Reduction , 2015, ACM Trans. Parallel Comput..

[12] James Demmel,et al. Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[13] Lukas Krämer,et al. Developing algorithms and software for the parallel solution of the symmetric eigenvalue problem , 2011, J. Comput. Sci..

[14] James Demmel,et al. Tradeoffs between synchronization, communication, and computation in parallel linear algebra computations , 2014, SPAA.

[15] G. Golub,et al. Parallel block schemes for large-scale least-squares computations , 1988 .

[16] D. Hartree. The Wave Mechanics of an Atom with a non-Coulomb Central Field. Part III. Term Values and Intensities in Series in Optical Spectra , 1928, Mathematical Proceedings of the Cambridge Philosophical Society.

[17] Thomas Auckenthaler,et al. Highly scalable eigensolvers for petaflop applications , 2012 .

[18] Jarle Berntsen,et al. Communication efficient matrix multiplication on hypercubes , 1989, Parallel Comput..

[19] Erik Elmroth,et al. New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems , 1998, PARA.

[20] James Demmel,et al. Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..

[21] P. Strazdins. A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization , 1998 .

[22] Robert A. van de Geijn,et al. Reduction to condensed form for the eigenvalue problem on distributed memory architectures , 1992, Parallel Comput..

[23] Jack Dongarra,et al. ScaLAPACK Users' Guide , 1987 .

[24] Inderjit S. Dhillon,et al. The design and implementation of the MRRR algorithm , 2006, TOMS.

[25] Robert A. van de Geijn,et al. SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[26] Rudnei Dias da Cunha,et al. New Parallel (Rank-Revealing) QR Factorization Algorithms , 2002, Euro-Par.

[27] Robert A. van de Geijn,et al. Parallel out-of-core computation and updating of the QR factorization , 2005, TOMS.

[28] James Demmel,et al. Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[29] S. Lennart Johnsson,et al. Minimizing the Communication Time for Matrix Multiplication on Multiprocessors , 1993, Parallel Comput..

[30] V. Fock,et al. Näherungsmethode zur Lösung des quantenmechanischen Mehrkörperproblems , 1930 .

[31] Bruno Lang,et al. A Parallel Algorithm for Reducing Symmetric Banded Matrices to Tridiagonal Form , 1993, SIAM J. Sci. Comput..

[32] Christian H. Bischof,et al. Algorithm 807: The SBR Toolbox—software for successive band reduction , 2000, TOMS.

[33] Edgar Solomonik. Provably Efficient Algorithms for Numerical Tensor Algebra , 2014 .

[34] Alexander Tiskin. Communication-efficient parallel generic pairwise elimination , 2007, Future Gener. Comput. Syst..

[35] H. T. Kung,et al. I/O complexity: The red-blue pebble game , 1981, STOC '81.

[36] James Demmel,et al. Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[37] Christian H. Bischof,et al. A framework for symmetric band reduction , 2000, TOMS.

[38] James Demmel,et al. Improving communication performance in dense linear algebra via topology aware collectives , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[39] D. Hartree. The Wave Mechanics of an Atom with a Non-Coulomb Central Field. Part I. Theory and Methods , 1928, Mathematical Proceedings of the Cambridge Philosophical Society.

[40] Emmanuel Jeannot,et al. Euro-Par 2011 Parallel Processing , 2011, Lecture Notes in Computer Science.

[41] Alok Aggarwal,et al. Communication Complexity of PRAMs , 1990, Theor. Comput. Sci..