Parallel Band Two-Sided MatrixBidiagonalization for Multicore Architectures

The objective of this paper is to extend, in the context of multicore architectures, the concepts of algorithms-by-tiles [Buttari et al., 2007] for Cholesky, LU, QR factorizations to the family of two- sided factorizations. In particular, the bidiagonal reduction of a general, dense matrix is very often used as a pre-processing step for calculating the singular value decomposition. Furthermore, in the last Top500 list from June 2008, 98% of the fastest parallel systems in the world were based on multicores. The manycore trend has increasingly exacerbated the problem, and it becomes critical to eciently integrate existing or new numerical linear algebra algorithms suitable for such hardware. By exploiting the concept of algorithms-by-tiles in the multicore environment (i.e., high level of parallelism with ne granularity and high performance data representation combined with a dynamic data driven execution), the band bidiagonal reduction presented here achieves 94 G op/s on a 12000 12000 matrix with 16 Intel Tigerton 2:4 GHz processors.

[1]  Erik Elmroth,et al.  Applying recursion to serial and parallel QR factorization leads to better performance , 2000, IBM J. Res. Dev..

[2]  Erik Elmroth,et al.  New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems , 1998, PARA.

[3]  Erik Elmroth,et al.  High-Performance Library Software for QR Factorization , 2000, PARA.

[4]  Rui Ralha,et al.  One-sided reduction to bidiagonal form , 2003 .

[5]  Jack Dongarra,et al.  QR Factorization for the CELL Processor , 2008 .

[6]  E. L. Yip,et al.  FORTRAN subroutines for out-of-core solutions of large complex linear systems , 1979 .

[7]  Jesse L. Barlow,et al.  Block and Parallel Versions of One-Sided Bidiagonalization , 2007, SIAM J. Matrix Anal. Appl..

[8]  Robert A. van de Geijn,et al.  Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures , 2008, 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008).

[9]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[10]  G. W. Stewart,et al.  Matrix Algorithms: Volume 1, Basic Decompositions , 1998 .

[11]  James Demmel,et al.  ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, PARA.

[12]  Tony F. Chan,et al.  An Improved Algorithm for Computing the Singular Value Decomposition , 1982, TOMS.

[13]  Jack J. Dongarra,et al.  A Parallel Algorithm for the Reduction of a Nonsymmetric Matrix to Block Upper-Hessenberg Form , 1995, Parallel Comput..

[14]  Robert A. van de Geijn,et al.  Updating an LU Factorization with Pivoting , 2008, TOMS.

[15]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[16]  Gene H. Golub,et al.  Calculating the singular values and pseudo-inverse of a matrix , 2007, Milestones in Matrix Computation.

[17]  Jack Dongarra,et al.  LAPACK Users' Guide, 3rd ed. , 1999 .

[18]  Jack Dongarra,et al.  Parallel Block Hessenberg Reduction usingAlgorithms-By-Tiles for Multicore ArchitecturesRevisited , 2009 .

[19]  Philipp Birken,et al.  Numerical Linear Algebra , 2011, Encyclopedia of Parallel Computing.

[20]  Z. Drmač,et al.  A new stable bidiagonal reduction algorithm , 2005 .

[21]  Jack J. Dongarra,et al.  Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization , 2008, IEEE Transactions on Parallel and Distributed Systems.

[22]  Bruno Lang,et al.  Efficient parallel reduction to bidiagonal form , 1999, Parallel Comput..

[23]  Bruno Lang,et al.  Parallel Reduction of Banded Matrices to Bidiagonal Form , 1996, Parallel Comput..

[24]  C. Loan,et al.  A Storage-Efficient $WY$ Representation for Products of Householder Transformations , 1989 .

[25]  Jack Dongarra,et al.  Parallel tiled QR factorization for multicore architectures , 2008 .