Tiled Algorithms for Matrix Computations on Multicore Architectures

The current computer architecture has moved towards the multi/many-core structure. However, the algorithms in the current sequential dense numerical linear algebra libraries (e.g. LAPACK) do not parallelize well on multi/many-core architectures. A new family of algorithms, the tile algorithms, has recently been introduced to circumvent this problem. Previous research has shown that it is possible to write efficient and scalable tile algorithms for performing a Cholesky factorization, a (pseudo) LU factorization, and a QR factorization. The goal of this thesis is to study tiled algorithms in a multi/many-core setting and to provide new algorithms which exploit the current architecture to improve performance relative to current state-of-the-art libraries while maintaining the stability and robustness of these libraries.

[1]  Emmanuel Agullo,et al.  Tile QR factorization with parallel panel processing for multicore architectures , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[2]  David A. Padua,et al.  On the Automatic Parallelization of the Perfect Benchmarks , 1998, IEEE Trans. Parallel Distributed Syst..

[3]  Frédéric Suter,et al.  Mixed parallel implementations of the top level step of Strassen and Winograd matrix multiplication algorithms , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[4]  Robert A. van de Geijn,et al.  Programming matrix algorithms-by-blocks for thread-level parallelism , 2009, TOMS.

[5]  Don Coppersmith,et al.  Matrix multiplication via arithmetic progressions , 1987, STOC.

[6]  V. Strassen Gaussian elimination is not optimal , 1969 .

[7]  Emmanuel Agullo,et al.  QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[8]  Victor Y. Pan,et al.  Strassen's algorithm is not optimal trilinear technique of aggregating, uniting and canceling for constructing fast algorithms for matrix operations , 1978, 19th Annual Symposium on Foundations of Computer Science (sfcs 1978).

[9]  Michael A. Heroux,et al.  GEMMW: A Portable Level 3 BLAS Winograd Variant of Strassen's Matrix-Matrix Multiply Algorithm , 1994, Journal of Computational Physics.

[10]  E. Kay,et al.  Graph Theory. An Algorithmic Approach , 1975 .

[11]  David J. Kuck,et al.  On Stable Parallel Linear System Solvers , 1978, JACM.

[12]  S. P. Kumar,et al.  Solving Linear Algebraic Equations on an MIMD Computer , 1983, JACM.

[13]  Arnold Schönhage,et al.  Partial and Total Matrix Multiplication , 1981, SIAM J. Comput..

[14]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[15]  Emmanuel Agullo,et al.  A Fully Empirical Autotuned Dense QR Factorization for Multicore Architectures , 2011, Euro-Par.

[16]  R. Clint Whaley,et al.  Achieving accurate and context‐sensitive timing for code optimization , 2008, Softw. Pract. Exp..

[17]  Julien Langou,et al.  Parallel tiled QR factorization for multicore architectures , 2007, Concurr. Comput. Pract. Exp..

[18]  Jack Dongarra,et al.  Enhancing Parallelism of Tile QR Factorization for Multicore Architectures , 2010 .

[19]  Yves Robert The Impact of Vector and Parallel Architectures on the Gaussian Elimination Algorithm , 1991 .

[20]  Yves Robert,et al.  Complexity of parallel QR factorization , 1986, JACM.

[21]  Jack Dongarra,et al.  ScaLAPACK Users' Guide , 1987 .

[22]  Robert A. van de Geijn,et al.  Families of algorithms related to the inversion of a Symmetric Positive Definite matrix , 2008, TOMS.

[23]  Yves Robert,et al.  Tiled QR factorization algorithms , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[24]  Jack Dongarra,et al.  QR factorization for the Cell Broadband Engine , 2009, HiPC 2009.

[25]  James Demmel,et al.  the Parallel Computing Landscape , 2022 .

[26]  Emmanuel Agullo,et al.  Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures , 2010, VECPAR.

[27]  Monica S. Lam,et al.  Jade: a high-level, machine-independent language for parallel programming , 1993, Computer.

[28]  Herb Sutter,et al.  A Fundamental Turn Toward Concurrency in Software , 2008 .

[29]  U. B. Vemulapati,et al.  QR Factorization , 2009, Encyclopedia of Optimization.

[30]  Julien Langou,et al.  A Critical Path Approach to Analyzing Parallelism of Algorithmic Variants. Application to Cholesky Inversion , 2010, ArXiv.

[31]  Shmuel Winograd,et al.  On multiplication of 2 × 2 matrices , 1971 .

[32]  Henri Casanova,et al.  Parallel Algorithms , 2019, Design and Analysis of Algorithms.

[33]  James Demmel,et al.  Communication-optimal parallel algorithm for strassen's matrix multiplication , 2012, SPAA '12.

[34]  Jesús Labarta,et al.  CellSs: Making it easier to program the Cell Broadband Engine processor , 2007, IBM J. Res. Dev..

[35]  Emmanuel Agullo,et al.  Comparative study of one-sided factorizations with multiple software packages on multi-core hardware , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[36]  Francesco Romani,et al.  Some Properties of Disjoint Sums of Tensors Related to Matrix Multiplication , 1982, SIAM J. Comput..

[37]  Jean-Guillaume Dumas,et al.  Memory efficient scheduling of Strassen-Winograd's matrix multiplication algorithm , 2007, ISSAC '09.

[38]  J. J. Modi,et al.  An alternative givens ordering , 1984 .

[39]  Yuefan Deng,et al.  Parallelizing Strassen's method for matrix multiplication on distributed-memory MIMD architectures☆ , 1995 .

[40]  Grazia Lotti,et al.  O(n2.7799) Complexity for n*n Approximate Matrix Multiplication , 1979, Inf. Process. Lett..

[41]  Jack Dongarra,et al.  Fully Dynamic Scheduler for Numerical Computing on Multicore Processors , 2009 .

[42]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[43]  M. Cosnard,et al.  Parallel QR decomposition of a rectangular matrix , 1986 .

[44]  Nicholas J. Higham,et al.  INVERSE PROBLEMS NEWSLETTER , 1991 .

[45]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[46]  Nicholas J. Higham,et al.  Stable and Efficient Spectral Divide and Conquer Algorithms for the Symmetric Eigenvalue Decomposition and the SVD , 2013, SIAM J. Sci. Comput..

[47]  Yves Robert,et al.  Optimal algorithms for Gaussian elimination on an MIMD computer , 1989, Parallel Comput..

[48]  James Demmel,et al.  LAPACK Users' Guide, Third Edition , 1999, Software, Environments and Tools.