Level-3 Cholesky Factorization Routines Improve Performance of Many Cholesky Algorithms

Four routines called DPOTF3i, i = a,b,c,d, are presented. DPOTF3i are a novel type of level-3 BLAS for use by BPF (Blocked Packed Format) Cholesky factorization and LAPACK routine DPOTRF. Performance of routines DPOTF3i are still increasing when the performance of Level-2 routine DPOTF2 of LAPACK starts decreasing. This is our main result and it implies, due to the use of larger block size nb, that DGEMM, DSYRK, and DTRSM performance also increases! The four DPOTF3i routines use simple register blocking. Different platforms have different numbers of registers. Thus, our four routines have different register blocking sizes. BPF is introduced. LAPACK routines for POTRF and PPTRF using BPF instead of full and packed format are shown to be trivial modifications of LAPACK POTRF source codes. We call these codes BPTRF. There are two variants of BPF: lower and upper. Upper BPF is “identical” to Square Block Packed Format (SBPF). “LAPACK” implementations on multicore processors use SBPF. Lower BPF is less efficient than upper BPF. Vector inplace transposition converts lower BPF to upper BPF very efficiently. Corroborating performance results for DPOTF3i versus DPOTF2 on a variety of common platforms are given for n ≈ nb as well as results for large n comparing DBPTRF versus DPOTRF.

[1]  Juan J. Navarro,et al.  Compiler-Optimized Kernels: An Efficient Alternative to Hand-Coded Inner Kernels , 2006, ICCSA.

[2]  Fred G. Gustavson IBM Research Report The Relevance of New Data Structure Approaches for Dense Linear Algebra in the New Multicore/Manycore Environments , 2008 .

[3]  John A. Gunnels,et al.  Minimal Data Copy for Dense Linear Algebra Factorization , 2006, PARA.

[4]  Julien Langou,et al.  A Critical Path Approach to Analyzing Parallelism of Algorithmic Variants. Application to Cholesky Inversion , 2010, ArXiv.

[5]  Lars Karlsson,et al.  Parallel and Cache-Efficient In-Place Matrix Storage Format Conversion , 2012, TOMS.

[6]  F. Gustavson,et al.  Level-3 Cholesky Factorization Routines as Part of Many Cholesky Algorithms , 2011 .

[7]  Fred G. Gustavson,et al.  Algorithm 865: Fortran 95 subroutines for Cholesky factorization in block hybrid format , 2007, TOMS.

[8]  Emmanuel Agullo,et al.  Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures , 2010, VECPAR.

[9]  José R. Herrero New Data Structures for Matrices and Specialized Inner Kernels: Low Overhead for High Performance , 2007, PPAM.

[10]  William Jalby,et al.  The use of BLAS3 in linear algebra on a parallel processor with a hierarchical memory , 1987 .

[11]  Fred G. Gustavson,et al.  A recursive formulation of Cholesky factorization of a matrix in packed storage , 2001, TOMS.

[12]  Keshav Pingali,et al.  An experimental comparison of cache-oblivious and cache-conscious programs , 2007, SPAA '07.

[13]  Jack J. Dongarra,et al.  Algorithm 679: A set of level 3 basic linear algebra subprograms: model implementation and test programs , 1990, TOMS.

[14]  R. C. Whaley,et al.  Empirically tuning LAPACK’s blocking factor for increased performance , 2008, 2008 International Multiconference on Computer Science and Information Technology.

[15]  Jack J. Dongarra,et al.  Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization , 2008, IEEE Transactions on Parallel and Distributed Systems.

[16]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[17]  Fred G. Gustavson,et al.  New Generalized Data Structures for Matrices Lead to a Variety of High Performance Algorithms , 2001, PPAM.

[18]  Fred G. Gustavson New Generalized Data Structures for Matrices Lead to a Variety of High Performance Algorithms , 2001 .

[19]  John A. Gunnels,et al.  A fully portable high performance minimal storage hybrid format Cholesky algorithm , 2005, TOMS.

[20]  Lars Karlsson,et al.  Blocked in-place transposition with application to storage format conversion , 2009 .

[21]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[22]  Fred G. Gustavson New Generalized Data Structures for Matrices Lead to a Variety of High Performance Dense Linear Algebra Algorithms , 2004, PARA.

[23]  Fred G. Gustavson,et al.  Recursion leads to automatic variable blocking for dense linear-algebra algorithms , 1997, IBM J. Res. Dev..

[24]  Erik Elmroth,et al.  SIAM REVIEW c ○ 2004 Society for Industrial and Applied Mathematics Vol. 46, No. 1, pp. 3–45 Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software ∗ , 2022 .

[25]  DongarraJack,et al.  Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization , 2008 .

[26]  Jack Dongarra,et al.  LAPACK Users' Guide, 3rd ed. , 1999 .

[27]  Fred G. Gustavson,et al.  In-Place Transposition of Rectangular Matrices , 2006, PARA.

[28]  F. G. Gustavson,et al.  High-performance linear algebra algorithms using new generalized data structures for matrices , 2003, IBM J. Res. Dev..

[29]  Isak Jonsson,et al.  Minimal-storage high-performance Cholesky factorization via blocking and recursion , 2000, IBM J. Res. Dev..

[30]  Jack J. Dongarra,et al.  Packed Storage Extension for ScaLAPACK , 2001, PPSC.