On Algorithmic Variants of Parallel Gaussian Elimination: Comparison of Implementations in Terms of Performance and Numerical Properties

Gaussian elimination is a canonical linear algebra procedure for solving linear systems of equations. In the last few years, the algorithm received a lot of attention in an attempt to improve its parallel performance. This article surveys recent developments in parallel implementations of the Gaussian elimination. Five different flavors are investigated. Three of them are based on different strategies for pivoting: partial pivoting, incremental pivoting, and tournament pivoting. The fourth one replaces pivoting with the Random Butterfly Transformation, and finally, an implementation without pivoting is used as a performance baseline. The technique of iterative refinement is applied to recover numerical accuracy when necessary. All parallel implementations are produced using dynamic, superscalar, runtime scheduling and tile matrix layout. Results on two multi-socket multicore systems are presented. Performance and numerical accuracy is analyzed.

[1]  J. Hess,et al.  Calculation of potential flow about arbitrary bodies , 1967 .

[2]  Jack J. Dongarra,et al.  Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[3]  Danny C. Sorensen,et al.  Analysis of Pairwise Pivoting in Gaussian Elimination , 1985, IEEE Transactions on Computers.

[4]  Jack J. Dongarra,et al.  Enhancing Parallelism of Tile Bidiagonal Transformation on Multicore Architectures Using Tree Reduction , 2011, PPAM.

[5]  Jack J. Dongarra,et al.  Scheduling dense linear algebra operations on multicore processors , 2010, Concurr. Comput. Pract. Exp..

[6]  J. Hess Panel Methods in Computational Fluid Dynamics , 1990 .

[7]  Robert A. van de Geijn,et al.  Programming matrix algorithms-by-blocks for thread-level parallelism , 2009, TOMS.

[8]  R. Aymar,et al.  Overview of ITER-FEAT - The future international burning plasma experiment , 2001 .

[9]  R. Harrington Origin and development of the method of moments for field computation , 1990, IEEE Antennas and Propagation Magazine.

[10]  Lars Karlsson,et al.  Parallel and Cache-Efficient In-Place Matrix Storage Format Conversion , 2012, TOMS.

[11]  E. L. Yip,et al.  FORTRAN subroutines for out-of-core solutions of large complex linear systems , 1979 .

[12]  James Demmel,et al.  Applied Numerical Linear Algebra , 1997 .

[13]  A Randomizing Butterfly Transformation Useful in Block Matrix Computations , 1995 .

[14]  Jack J. Dongarra,et al.  Parallel Two-Sided Matrix Reduction to Band Bidiagonal Form on Multicore Architectures , 2010, IEEE Transactions on Parallel and Distributed Systems.

[15]  Jack J. Dongarra,et al.  A Comprehensive Study of Task Coalescing for Selecting Parallelism Granularity in a Two-Stage Bidiagonal Reduction , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[16]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[17]  Fred G. Gustavson,et al.  Recursion leads to automatic variable blocking for dense linear-algebra algorithms , 1997, IBM J. Res. Dev..

[18]  L. Foster Gaussian Elimination with Partial Pivoting Can Fail in Practice , 1994, SIAM J. Matrix Anal. Appl..

[19]  James Demmel,et al.  Communication avoiding Gaussian elimination , 2008, HiPC 2008.

[20]  David Smithe,et al.  Global-wave solutions with self-consistent velocity distributions in ion cyclotron heated plasmas , 2006 .

[21]  Emmanuel Agullo,et al.  LU factorization for accelerator-based systems , 2011, 2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA).

[22]  James Demmel,et al.  CALU: A Communication Optimal LU Factorization Algorithm , 2011, SIAM J. Matrix Anal. Appl..

[23]  Jack J. Dongarra,et al.  Two-Stage Tridiagonal Reduction for Dense Symmetric Matrices Using Tile Algorithms on Multicore Architectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[24]  L. Trefethen,et al.  Average-case stability of Gaussian elimination , 1990 .

[25]  Richard F. Barrett,et al.  Complex version of high performance computing LINPACK benchmark (HPL) , 2010 .

[26]  Jack J. Dongarra,et al.  Exploiting Fine-Grain Parallelism in Recursive LU Factorization , 2011, PARCO.

[27]  Joseph F. Grcar,et al.  Mathematicians of Gaussian Elimination , 2011 .

[28]  Gene H. Golub,et al.  Matrix computations , 1983 .

[29]  Jack J. Dongarra,et al.  Accelerating Linear System Solutions Using Randomization Techniques , 2013, TOMS.

[30]  Jack J. Dongarra,et al.  High performance matrix inversion based on LU factorization for multicore architectures , 2011, MTAGS '11.

[31]  James Demmel,et al.  Error bounds from extra-precise iterative refinement , 2006, TOMS.

[32]  E D'Azevedo,et al.  Sheared poloidal flow driven by mode conversion in tokamak plasmas. , 2003, Physical review letters.

[33]  Jack Dongarra,et al.  QUARK Users' Guide: QUeueing And Runtime for Kernels , 2011 .

[34]  Mei Han An,et al.  accuracy and stability of numerical algorithms , 1991 .

[35]  Jack J. Dongarra,et al.  High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures , 2013, TOMS.

[36]  J. H. Wilkinson The algebraic eigenvalue problem , 1966 .

[37]  Emmanuel Agullo,et al.  Comparative study of one-sided factorizations with multiple software packages on multi-core hardware , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[38]  Julien Langou,et al.  The Impact of Multicore on Math Software , 2006, PARA.

[39]  Johnson J. H. Wang Generalized Moment Methods in Electromagnetics: Formulation and Computer Solution of Integral Equations , 1991 .

[40]  Cleve B. Moler,et al.  Iterative Refinement in Floating Point , 1967, JACM.

[41]  Laura Grigori,et al.  Adapting communication-avoiding LU and QR factorizations to multicore architectures , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[42]  Jack Dongarra,et al.  Parallel tiled QR factorization for multicore architectures , 2008 .

[43]  G. Stewart Introduction to matrix computations , 1973 .

[44]  Jack J. Dongarra,et al.  Anatomy of a globally recursive embedded LINPACK benchmark , 2012, 2012 IEEE Conference on High Performance Extreme Computing.

[45]  T. Chan,et al.  Probabilistic Analysis of Gaussian Elimination Without Pivoting , 1997 .

[46]  J. Demmel,et al.  Implementing Communication-Optimal Parallel and Sequential QR Factorizations , 2008, 0809.2407.

[47]  Victor Eijkhout,et al.  Recursive approach in sparse matrix LU factorization , 2001, Sci. Program..

[48]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[49]  Alan Edelman,et al.  Large Dense Numerical Linear Algebra in 1993: the Parallel Computing Influence , 1993, Int. J. High Perform. Comput. Appl..

[50]  Jack J. Dongarra,et al.  Analysis of dynamically scheduled tile algorithms for dense linear algebra on multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[51]  Eduardo F. D'Azevedo,et al.  Advances in full-wave modeling of radio frequency heated, multidimensional plasmas , 2002 .

[52]  J. Dongarra,et al.  Parallel Band Two-Sided Matrix Bidiagonalization for Multicore Architectures LAPACK Working Note # 209 , 2008 .