Automatic blocking of QR and LU factorizations for locality

QR and LU factorizations for dense matrices are important linear algebra computations that are widely used in scientific applications. To efficiently perform these computations on modern computers, the factorization algorithms need to be blocked when operating on large matrices to effectively exploit the deep cache hierarchy prevalent in today's computer memory systems. Because both QR (based on Householder transformations) and LU factorization algorithms contain complex loop structures, few compilers can fully automate the blocking of these algorithms. Though linear algebra libraries such as LAPACK provides manually blocked implementations of these algorithms, by automatically generating blocked versions of the computations, more benefit can be gained such as automatic adaptation of different blocking strategies. This paper demonstrates how to apply an aggressive loop transformation technique, dependence hoisting, to produce efficient blockings for both QR and LU with partial pivoting. We present different blocking strategies that can be generated by our optimizer and compare the performance of auto-blocked versions with manually tuned versions in LAPACK, both using reference BLAS, ATLAS BLAS and native BLAS specially tuned for the underlying machine architectures.

[1]  Monica S. Lam,et al.  An affine partitioning algorithm to maximize parallelism and minimize communication , 1999, ICS '99.

[2]  Steve Carr,et al.  Compiler blockability of dense matrix factorizations , 1997, TOMS.

[3]  Ken Kennedy,et al.  Typed Fusion with Applications to Parallel and Sequential Code Generation , 1994 .

[4]  Ken Kennedy,et al.  Compiler blockability of numerical algorithms , 1992, Proceedings Supercomputing '92.

[5]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[6]  Larry Carter,et al.  Quantifying the Multi-level Nature of Tiling Interactions , 1997, LCPC.

[7]  Larry Carter,et al.  Hierarchical tiling for improved superscalar performance , 1995, Proceedings of 9th International Parallel Processing Symposium.

[8]  F. Gustavson,et al.  Implementing Linear Algebra Algorithms for Dense Matrices on a Vector Pipeline Machine , 1984 .

[9]  Keshav Pingali,et al.  Synthesizing transformations for locality enhancement of imperfectly-nested loop nests , 2000 .

[10]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[11]  William Pugh,et al.  Iteration Space Slicing for Locality , 1999, LCPC.

[12]  William Pugh,et al.  The Omega Library interface guide , 1995 .

[13]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[14]  William Pugh,et al.  Uniform techniques for loop optimization , 1991, ICS '91.

[15]  Ken Kennedy,et al.  Transforming Complex Loop Nests for Locality , 2004, The Journal of Supercomputing.

[16]  Ken Kennedy Fast greedy weighted fusion , 2000, ICS '00.

[17]  Allen,et al.  Optimizing Compilers for Modern Architectures , 2004 .

[18]  Ken Kennedy,et al.  Improving the ratio of memory operations to floating-point operations in loops , 1994, TOPL.

[19]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[20]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.

[21]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[22]  Chau-Wen Tseng,et al.  Data transformations for eliminating conflict misses , 1998, PLDI.