CALU: A Communication Optimal LU Factorization Algorithm

Since the cost of communication (moving data) greatly exceeds the cost of doing arithmetic on current and future computing platforms, we are motivated to devise algorithms that communicate as little as possible, even if they do slightly more arithmetic, and as long as they still get the right answer. This paper is about getting the right answer for such an algorithm. It discusses CALU, a communication avoiding LU factorization algorithm based on a new pivoting strategy, that we refer to as tournament pivoting. The reason to consider CALU is that it does an optimal amount of communication, and asymptotically less than Gaussian elimination with partial pivoting (GEPP), and so will be much faster on platforms where communication is expensive, as shown in previous work. We show that the Schur complement obtained after each step of performing CALU on a matrix $A$ is the same as the Schur complement obtained after performing GEPP on a larger matrix whose entries are the same as the entries of $A$ (sometimes slightly perturbed) and zeros. More generally, the entire CALU process is equivalent to GEPP on a large, but very sparse matrix, formed by entries of $A$ and zeros. Hence we expect that CALU will behave as GEPP and it will also be very stable in practice. In addition, extensive experiments on random matrices and a set of special matrices show that CALU is stable in practice. The upper bound on the growth factor of CALU is worse than that of GEPP. However, there are Wilkinson-like matrices for which GEPP has exponential growth factor, but not CALU, and vice-versa.

[1]  Danny C. Sorensen,et al.  Analysis of Pairwise Pivoting in Gaussian Elimination , 1985, IEEE Transactions on Computers.

[2]  Toshio Endo,et al.  Highly latency tolerant Gaussian elimination , 2005, The 6th IEEE/ACM International Workshop on Grid Computing, 2005..

[3]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[4]  D. W. Barron,et al.  Solution of Simultaneous Linear Equations using a Magnetic-Tape Store , 1960, Computer/law journal.

[5]  Jack J. Dongarra,et al.  Towards dense linear algebra for hybrid GPU accelerated manycore systems , 2009, Parallel Comput..

[6]  Jaeyoung Choi,et al.  Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines , 1994, Sci. Program..

[7]  Nicholas J. Higham,et al.  INVERSE PROBLEMS NEWSLETTER , 1991 .

[8]  Ed Anderson,et al.  LAPACK Users' Guide , 1995 .

[9]  James Demmel,et al.  Applied Numerical Linear Algebra , 1997 .

[10]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[11]  James Demmel,et al.  Communication Avoiding Gaussian elimination , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Robert A. van de Geijn,et al.  Rapid Development of High-Performance Out-of-Core Solvers , 2004, PARA.

[13]  Laura Grigori,et al.  Adapting communication-avoiding LU and QR factorizations to multicore architectures , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[14]  E. L. Yip,et al.  FORTRAN subroutines for out-of-core solutions of large complex linear systems , 1979 .

[15]  Robert A. van de Geijn,et al.  Updating an LU Factorization with Pivoting , 2008, TOMS.

[16]  James Demmel,et al.  Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..

[17]  N. Higham,et al.  Large growth factors in Gaussian elimination with pivoting , 1989 .

[18]  R. Skeel Iterative refinement implies numerical stability for Gaussian elimination , 1980 .

[19]  Jack J. Dongarra,et al.  The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[20]  James Demmel,et al.  Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[21]  Sivan Toledo Locality of Reference in LU Decomposition with Partial Pivoting , 1997, SIAM J. Matrix Anal. Appl..

[22]  E. S. Quintana-Ortí,et al.  Programming Algorithms-by-Blocks for Matrix Computations on Multithreaded Architectures FLAME Working Note # 29 , 2008 .

[23]  Fred G. Gustavson,et al.  Recursion leads to automatic variable blocking for dense linear-algebra algorithms , 1997, IBM J. Res. Dev..

[24]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[25]  L. Trefethen,et al.  Average-case stability of Gaussian elimination , 1990 .