Scheduling Two-sided Transformations using Algorithms-by-Tiles on Multicore Architectures LAPACK Working Note # 214

The objective of this paper is to describe, in the context of multicore architectures, different scheduler implementations for the two-sided linear algebra transformations, in particular the Hessenberg and Bidiagonal reductions which are the first steps for the standard eigenvalue problems and the singular value decompositions respectively. State-of-the-art dense linear algebra softwares, such as the LAPACK and ScaLAPACK libraries, suffer performance losses on multicore processors due to their inability to fully exploit thread-level parallelism. At the same time the coarse-grain dataflow model gains popularity as a paradigm for programming multicore architectures. By using the concepts of algorithms-by-tiles [Buttari et al., 2007] along with efficient mechanisms for data-driven execution, these two-sided reductions achieve high performance computing. The main drawback of the algorithms-bytiles approach for two-sided transformations is that the full reduction can not be obtained in one stage. Other methods have to be considered to further reduce the band matrices to the required forms.

[1]  John A. Gunnels,et al.  Minimal Data Copy for Dense Linear Algebra Factorization , 2006, PARA.

[2]  James Demmel,et al.  ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, PARA.

[3]  Robert A. van de Geijn,et al.  Updating an LU Factorization with Pivoting , 2008, TOMS.

[4]  Jack J. Dongarra,et al.  Implementing Linear Algebra Routines on Multi-core Processors with Pipelining and a Look Ahead , 2006, PARA.

[5]  Jack J. Dongarra,et al.  Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization , 2008, IEEE Transactions on Parallel and Distributed Systems.

[6]  DongarraJack,et al.  Parallel tiled QR factorization for multicore architectures , 2008 .

[7]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[8]  Viktor K. Prasanna,et al.  Tiling, Block Data Layout, and Memory Hierarchy Performance , 2003, IEEE Trans. Parallel Distributed Syst..

[9]  Jack Dongarra,et al.  LAPACK Users' Guide, 3rd ed. , 1999 .

[10]  G. W. Stewart,et al.  Matrix Algorithms: Volume 1, Basic Decompositions , 1998 .

[11]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[12]  Julien Langou,et al.  Parallel tiled QR factorization for multicore architectures , 2007, Concurr. Comput. Pract. Exp..

[13]  C. Loan,et al.  A Storage-Efficient $WY$ Representation for Products of Householder Transformations , 1989 .

[14]  Markus Hegland,et al.  A Parallel Algorithm for the Reduction to Tridiagonal Form for Eigendecomposition , 1999, SIAM J. Sci. Comput..

[15]  Matemática,et al.  Society for Industrial and Applied Mathematics , 2010 .

[16]  Jack Dongarra,et al.  QR Factorization for the CELL Processor , 2008 .

[17]  C. Danforth,et al.  Estimating and Correcting Global Weather Model Error , 2007 .

[18]  Erik Elmroth,et al.  SIAM REVIEW c ○ 2004 Society for Industrial and Applied Mathematics Vol. 46, No. 1, pp. 3–45 Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software ∗ , 2022 .

[19]  Ramesh C. Agarwal,et al.  Vector and parallel algorithms for Cholesky factorization on IBM 3090 , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[20]  Taher H. Haveliwala,et al.  The Second Eigenvalue of the Google Matrix , 2003 .

[21]  Erik Elmroth,et al.  High-Performance Library Software for QR Factorization , 2000, PARA.

[22]  Jesús Labarta,et al.  CellSs: Making it easier to program the Cell Broadband Engine processor , 2007, IBM J. Res. Dev..

[23]  Philipp Birken,et al.  Numerical Linear Algebra , 2011, Encyclopedia of Parallel Computing.

[24]  Erik Elmroth,et al.  Applying recursion to serial and parallel QR factorization leads to better performance , 2000, IBM J. Res. Dev..

[25]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[26]  Robert A. van de Geijn,et al.  Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures , 2008, 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008).

[27]  Rosa M. Badia,et al.  CellSs: a Programming Model for the Cell BE Architecture , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[28]  Fred G. Gustavson,et al.  New Generalized Matrix Data Structures Lead to a Variety of High-Performance Algorithms , 2000, The Architecture of Scientific Software.

[29]  Robert A. van de Geijn,et al.  Parallel out-of-core computation and updating of the QR factorization , 2005, TOMS.

[30]  E. L. Yip,et al.  FORTRAN subroutines for out-of-core solutions of large complex linear systems , 1979 .

[31]  Jesse L. Barlow,et al.  Block and Parallel Versions of One-Sided Bidiagonalization , 2007, SIAM J. Matrix Anal. Appl..

[32]  Viktor K. Prasanna,et al.  Analysis of memory hierarchy performance of block data layout , 2002, Proceedings International Conference on Parallel Processing.

[33]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[34]  Jack Dongarra,et al.  Multithreading for synchronization tolerance in matrix factorization , 2007 .

[35]  Z. Drmač,et al.  A new stable bidiagonal reduction algorithm , 2005 .

[36]  Erik Elmroth,et al.  New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems , 1998, PARA.

[37]  S. P. Kumar,et al.  Solving Linear Algebraic Equations on an MIMD Computer , 1983, JACM.

[38]  Jack Dongarra,et al.  Parallel Tiled QR Factorization for Multicore Architectures LAPACK Working Note # 190 , 2007 .