Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting

The LU factorization is an important numerical algorithm for solving systems of linear equations in science and engineering and is a characteristic of many dense linear algebra computations. For example, it has become the de facto numerical algorithm implemented within the LINPACK benchmark to rank the most powerful supercomputers in the world, collected by the TOP500 website. Multicore processors continue to present challenges to the development of fast and robust numerical software due to the increasing levels of hardware parallelism and widening gap between core and memory speeds. In this context, the difficulty in developing new algorithms for the scientific community resides in the combination of two goals: achieving high performance while maintaining the accuracy of the numerical algorithm. This paper proposes a new approach for computing the LU factorization in parallel on multicore architectures, which not only improves the overall performance but also sustains the numerical quality of the standard LU factorization algorithm with partial pivoting. While the update of the trailing submatrix is computationally intensive and highly parallel, the inherently problematic portion of the LU factorization is the panel factorization due to its memory‐bound characteristic as well as the atomicity of selecting the appropriate pivots. Our approach uses a parallel fine‐grained recursive formulation of the panel factorization step and implements the update of the trailing submatrix with the tile algorithm. Based on conflict‐free partitioning of the data and lockless synchronization mechanisms, our implementation lets the overall computation flow naturally without contention. The dynamic runtime system called QUARK is then able to schedule tasks with heterogeneous granularities and to transparently introduce algorithmic lookahead. The performance results of our implementation are competitive compared to the currently available software packages and libraries. For example, it is up to 40% faster when compared to the equivalent Intel MKL routine and up to threefold faster than LAPACK with multithreaded Intel MKL BLAS. Copyright © 2013 John Wiley & Sons, Ltd.

[1]  Jack J. Dongarra,et al.  Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[2]  Danny C. Sorensen,et al.  Analysis of Pairwise Pivoting in Gaussian Elimination , 1985, IEEE Transactions on Computers.

[3]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[4]  Jack J. Dongarra,et al.  Analysis of dynamically scheduled tile algorithms for dense linear algebra on multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[5]  Jack J. Dongarra,et al.  High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures , 2013, TOMS.

[6]  Jack J. Dongarra,et al.  Scheduling two-sided transformations using tile algorithms on multicore architectures , 2010, Sci. Program..

[7]  David Abramson,et al.  The Virtual Laboratory: a toolset to enable distributed molecular modelling for drug design on the World‐Wide Grid , 2003, Concurr. Comput. Pract. Exp..

[8]  Erik Elmroth,et al.  Design and Evaluation of Parallel Block Algorithms: LU Factorization on an IBM 3090 VF/600J , 1991, PPSC.

[9]  Robert A. van de Geijn,et al.  Managing the complexity of lookahead for LU factorization with pivoting , 2010, SPAA '10.

[10]  Erik Elmroth,et al.  New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems , 1998, PARA.

[11]  Jack J. Dongarra,et al.  A Comprehensive Study of Task Coalescing for Selecting Parallelism Granularity in a Two-Stage Bidiagonal Reduction , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[12]  Jack J. Dongarra,et al.  Enhancing Parallelism of Tile Bidiagonal Transformation on Multicore Architectures Using Tree Reduction , 2011, PPAM.

[13]  Jack J. Dongarra,et al.  Scheduling dense linear algebra operations on multicore processors , 2010, Concurr. Comput. Pract. Exp..

[14]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[15]  Jack Dongarra,et al.  ScaLAPACK Users' Guide , 1987 .

[16]  Dror Irony,et al.  Communication-Efficient Parallel Dense LU Using a3-Dimnsional Approach , 2001, PPSC.

[17]  Håkan Sundell,et al.  Efficient and Practical Non-Blocking Data Structures , 2004 .

[18]  E. Dijkstra On the Role of Scientific Thought , 1982 .

[19]  Julien Langou,et al.  Parallel tiled QR factorization for multicore architectures , 2007, Concurr. Comput. Pract. Exp..

[20]  Jesús Labarta,et al.  A dependency-aware task-based programming environment for multi-core architectures , 2008, 2008 IEEE International Conference on Cluster Computing.

[21]  Guillaume Mercier,et al.  hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[22]  Emmanuel Agullo,et al.  LU factorization for accelerator-based systems , 2011, 2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA).

[23]  Jack J. Dongarra,et al.  The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[24]  E. L. Yip,et al.  FORTRAN subroutines for out-of-core solutions of large complex linear systems , 1979 .

[25]  Chris Reade,et al.  Elements of functional programming , 1989, International computer science series.

[26]  Jack Dongarra,et al.  Scheduling dense linear algebra operations on multicore processors , 2010 .

[27]  Emmanuel Agullo,et al.  Comparative study of one-sided factorizations with multiple software packages on multi-core hardware , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[28]  Robert A. van de Geijn,et al.  Programming matrix algorithms-by-blocks for thread-level parallelism , 2009, TOMS.

[29]  Jack J. Dongarra,et al.  Evaluating Block Algorithm Variants in LAPACK , 1989, PPSC.

[30]  Ken Kennedy,et al.  Automatic blocking of QR and LU factorizations for locality , 2004, MSP '04.

[31]  Jack Dongarra,et al.  LAPACK Working Note 18: Implementation Guide for LAPACK , 1990 .

[32]  Jack Dongarra,et al.  QUARK Users' Guide: QUeueing And Runtime for Kernels , 2011 .

[33]  Jack J. Dongarra,et al.  EZTrace: A Generic Framework for Performance Analysis , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[34]  Jerzy Wasniewski,et al.  Recursive Version of LU Decomposition , 2000, NAA.

[35]  Fred G. Gustavson,et al.  Recursion leads to automatic variable blocking for dense linear-algebra algorithms , 1997, IBM J. Res. Dev..

[36]  Victor Eijkhout,et al.  Recursive approach in sparse matrix LU factorization , 2001, Sci. Program..

[37]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[38]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[39]  DongarraJack,et al.  Parallel tiled QR factorization for multicore architectures , 2008 .

[40]  Jack J. Dongarra,et al.  Two-Stage Tridiagonal Reduction for Dense Symmetric Matrices Using Tile Algorithms on Multicore Architectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[41]  Jack J. Dongarra,et al.  Anatomy of a globally recursive embedded LINPACK benchmark , 2012, 2012 IEEE Conference on High Performance Extreme Computing.

[42]  Edsger W. Dijkstra,et al.  Selected Writings on Computing: A personal Perspective , 1982, Texts and Monographs in Computer Science.

[43]  John L. Gustafson,et al.  Reevaluating Amdahl's law , 1988, CACM.