论文信息 - Scheduling Two-sided Transformations using Algorithms-by-Tiles on Multicore Architectures LAPACK Working Note # 214

Scheduling Two-sided Transformations using Algorithms-by-Tiles on Multicore Architectures LAPACK Working Note # 214

The objective of this paper is to describe, in the context of multicore architectures, different scheduler implementations for the two-sided linear algebra transformations, in particular the Hessenberg and Bidiagonal reductions which are the first steps for the standard eigenvalue problems and the singular value decompositions respectively. State-of-the-art dense linear algebra softwares, such as the LAPACK and ScaLAPACK libraries, suffer performance losses on multicore processors due to their inability to fully exploit thread-level parallelism. At the same time the coarse-grain dataflow model gains popularity as a paradigm for programming multicore architectures. By using the concepts of algorithms-by-tiles [Buttari et al., 2007] along with efficient mechanisms for data-driven execution, these two-sided reductions achieve high performance computing. The main drawback of the algorithms-bytiles approach for two-sided transformations is that the full reduction can not be obtained in one stage. Other methods have to be considered to further reduce the band matrices to the required forms.

[1] John A. Gunnels,et al. Minimal Data Copy for Dense Linear Algebra Factorization , 2006, PARA.

[2] James Demmel,et al. ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, PARA.

[3] Robert A. van de Geijn,et al. Updating an LU Factorization with Pivoting , 2008, TOMS.

[4] Jack J. Dongarra,et al. Implementing Linear Algebra Routines on Multi-core Processors with Pipelining and a Look Ahead , 2006, PARA.

[5] Jack J. Dongarra,et al. Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization , 2008, IEEE Transactions on Parallel and Distributed Systems.

[6] DongarraJack,et al. Parallel tiled QR factorization for multicore architectures , 2008 .

[7] G. G. Stokes. "J." , 1890, The New Yale Book of Quotations.

[8] Viktor K. Prasanna,et al. Tiling, Block Data Layout, and Memory Hierarchy Performance , 2003, IEEE Trans. Parallel Distributed Syst..

[9] Jack Dongarra,et al. LAPACK Users' Guide, 3rd ed. , 1999 .

[10] G. W. Stewart,et al. Matrix Algorithms: Volume 1, Basic Decompositions , 1998 .

[11] Aaas News,et al. Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[12] Julien Langou,et al. Parallel tiled QR factorization for multicore architectures , 2007, Concurr. Comput. Pract. Exp..

[13] C. Loan,et al. A Storage-Efficient $WY$ Representation for Products of Householder Transformations , 1989 .

[14] Markus Hegland,et al. A Parallel Algorithm for the Reduction to Tridiagonal Form for Eigendecomposition , 1999, SIAM J. Sci. Comput..

[15] Matemática,et al. Society for Industrial and Applied Mathematics , 2010 .

[16] Jack Dongarra,et al. QR Factorization for the CELL Processor , 2008 .

[17] C. Danforth,et al. Estimating and Correcting Global Weather Model Error , 2007 .

[18] Erik Elmroth,et al. SIAM REVIEW c ○ 2004 Society for Industrial and Applied Mathematics Vol. 46, No. 1, pp. 3–45 Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software ∗ , 2022 .

[19] Ramesh C. Agarwal,et al. Vector and parallel algorithms for Cholesky factorization on IBM 3090 , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[20] Taher H. Haveliwala,et al. The Second Eigenvalue of the Google Matrix , 2003 .

[21] Erik Elmroth,et al. High-Performance Library Software for QR Factorization , 2000, PARA.

[22] Jesús Labarta,et al. CellSs: Making it easier to program the Cell Broadband Engine processor , 2007, IBM J. Res. Dev..

[23] Philipp Birken,et al. Numerical Linear Algebra , 2011, Encyclopedia of Parallel Computing.

[24] Erik Elmroth,et al. Applying recursion to serial and parallel QR factorization leads to better performance , 2000, IBM J. Res. Dev..

[25] Gene H. Golub,et al. Matrix computations (3rd ed.) , 1996 .

[26] Robert A. van de Geijn,et al. Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures , 2008, 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008).

[27] Rosa M. Badia,et al. CellSs: a Programming Model for the Cell BE Architecture , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[28] Fred G. Gustavson,et al. New Generalized Matrix Data Structures Lead to a Variety of High-Performance Algorithms , 2000, The Architecture of Scientific Software.

[29] Robert A. van de Geijn,et al. Parallel out-of-core computation and updating of the QR factorization , 2005, TOMS.

[30] E. L. Yip,et al. FORTRAN subroutines for out-of-core solutions of large complex linear systems , 1979 .

[31] Jesse L. Barlow,et al. Block and Parallel Versions of One-Sided Bidiagonalization , 2007, SIAM J. Matrix Anal. Appl..

[32] Viktor K. Prasanna,et al. Analysis of memory hierarchy performance of block data layout , 2002, Proceedings International Conference on Parallel Processing.

[33] Julien Langou,et al. A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[34] Jack Dongarra,et al. Multithreading for synchronization tolerance in matrix factorization , 2007 .

[35] Z. Drmač,et al. A new stable bidiagonal reduction algorithm , 2005 .

[36] Erik Elmroth,et al. New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems , 1998, PARA.

[37] S. P. Kumar,et al. Solving Linear Algebraic Equations on an MIMD Computer , 1983, JACM.

[38] Jack Dongarra,et al. Parallel Tiled QR Factorization for Multicore Architectures LAPACK Working Note # 190 , 2007 .