Two-Stage Tridiagonal Reduction for Dense Symmetric Matrices Using Tile Algorithms on Multicore Architectures

While successful implementations have already been written for one-sided transformations (e.g., QR, LU and Cholesky factorizations) on multicore architecture, getting high performance for two-sided reductions (e.g., Hessenberg, tridiagonal and bidiagonal reductions) is still an open and difficult research problem due to expensive memory-bound operations occurring during the panel factorization. The processor-memory speed gap continues to widen, which has even further exacerbated the problem. This paper focuses on an efficient implementation of the tridiagonal reduction, which is the first algorithmic step toward computing the spectral decomposition of a dense symmetric matrix. The original matrix is translated into a \emph{tile} layout i.e., a high performance data representation, which substantially enhances data locality. Following a two-stage approach, the tile matrix is then transformed into band tridiagonal form using compute intensive kernels. The band form is further reduced to the required tridiagonal form using a \emph{left-looking} bulge chasing technique to reduce memory traffic and memory contention. A dependence translation layer associated with a dynamic runtime system allows for scheduling and overlapping tasks generated from both stages. The obtained tile tridiagonal reduction significantly outperforms the state-of-the-art numerical libraries (10X against multithreaded LAPACK with optimized MKL BLAS and 2.5X against the commercial numerical software Intel MKL) from medium to large matrix sizes.

[1]  Jack J. Dongarra,et al.  Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing , 2010, Parallel Comput..

[2]  Enrique S. Quintana-Ortí,et al.  Reduction to Condensed Forms for Symmetric Eigenvalue Problems on Multi-core Architectures , 2009, PPAM.

[3]  John A. Sharp,et al.  Data flow computing: theory and practice , 1992 .

[4]  B. Kågström,et al.  Blocked algorithms for the reduction to Hessenberg-triangular form revisited , 2008 .

[5]  Jack J. Dongarra,et al.  Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[6]  Serge G. Petiton,et al.  Workflow Global Computing with YML , 2006, 2006 7th IEEE/ACM International Conference on Grid Computing.

[7]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[8]  Julien Langou,et al.  The Impact of Multicore on Math Software , 2006, PARA.

[9]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[10]  Christian H. Bischof,et al.  Algorithm 807: The SBR Toolbox—software for successive band reduction , 2000, TOMS.

[11]  R. Dolbeau,et al.  HMPP TM : A Hybrid Multi-core Parallel Programming Environment , 2022 .

[12]  Rajkumar Buyya,et al.  A Taxonomy of Workflow Management Systems for Grid Computing , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[13]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[14]  Jack Dongarra,et al.  Parallel Block Hessenberg Reduction usingAlgorithms-By-Tiles for Multicore ArchitecturesRevisited , 2009 .

[15]  Philipp Birken,et al.  Numerical Linear Algebra , 2011, Encyclopedia of Parallel Computing.

[16]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[17]  Horst D. Simon,et al.  The solution of large dense generalized eigenvalue problems on the Cray X-MP/24 with SSD , 1987 .

[18]  Carl Kesselman,et al.  Generalized communicators in the Message Passing Interface , 1996, Proceedings. Second MPI Developer's Conference.

[19]  Jack J. Dongarra,et al.  Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization , 2008, IEEE Transactions on Parallel and Distributed Systems.

[20]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[21]  Jack Dongarra,et al.  ScaLAPACK Users' Guide , 1987 .

[22]  Jack Dongarra,et al.  QR Factorization for the CELL Processor , 2008 .

[23]  Jack J. Dongarra,et al.  Scheduling two-sided transformations using tile algorithms on multicore architectures , 2010, Sci. Program..

[24]  Ken Kennedy,et al.  Automatic blocking of QR and LU factorizations for locality , 2004, MSP '04.

[25]  Robert A. van de Geijn,et al.  Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures , 2007, SPAA '07.

[26]  Fred G. Gustavson,et al.  High Performance Computing with the Cell Broadband Engine , 2009, Sci. Program..

[27]  Emmanuel Jeannot,et al.  Automatic Parallelization Techniques Based on Compact DAG Extraction and Symbolic Scheduling , 2001, Parallel Process. Lett..

[28]  R. Martin,et al.  Electronic Structure: Basic Theory and Practical Methods , 2004 .

[29]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[30]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[31]  DongarraJack,et al.  Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization , 2008 .

[32]  T. Davis,et al.  Algorithm 8 xx : PIRO BAND , Pipelined Plane Rotations for Blocked Band Reduction , 2009 .

[33]  Alex Rapaport,et al.  Mpi-2: extensions to the message-passing interface , 1997 .

[34]  Jack Dongarra,et al.  PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing , 1995 .

[35]  Julien Langou,et al.  Parallel tiled QR factorization for multicore architectures , 2007, Concurr. Comput. Pract. Exp..

[36]  Jesús Labarta,et al.  A dependency-aware task-based programming environment for multi-core architectures , 2008, 2008 IEEE International Conference on Cluster Computing.

[37]  Notker Rösch,et al.  ParaGauss: The Density Functional Program ParaGauss for Complex Systems in Chemistry , 2005 .

[38]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[39]  Arndt Bode,et al.  High Performance Computing in Science and Engineering, Garching 2004 , 2005 .

[40]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[41]  Emmanuel Agullo,et al.  Comparative study of one-sided factorizations with multiple software packages on multi-core hardware , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[42]  Jaeyoung Choi,et al.  Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines , 1994, Sci. Program..

[43]  Robert A. van de Geijn,et al.  Updating an LU Factorization with Pivoting , 2008, TOMS.