Recursion based parallelization of exact dense linear algebra routines for Gaussian elimination

Use of fast matrix arithmetic and modular reductions penalize fine grainsBlock recursive algorithms with OpenMP tasks can reach state of the art efficiencyThe libkomp library handles recursive tasks more efficiently than libgompRecursive PLUQ decomposition behaves best using explicit task synchronizationsDataflow task synchronizations improves efficiency on finer grain implementations We present block algorithms and their implementation for the parallelization of sub-cubic Gaussian elimination on shared memory architectures. Contrarily to the classical cubic algorithms in parallel numerical linear algebra, we focus here on recursive algorithms and coarse grain parallelization. Indeed, sub-cubic matrix arithmetic can only be achieved through recursive algorithms making coarse grain block algorithms perform more efficiently than fine grain ones. This work is motivated by the design and implementation of dense linear algebra over a finite field, where fast matrix multiplication is used extensively and where costly modular reductions also advocate for coarse grain block decomposition. We incrementally build efficient kernels, for matrix multiplication first, then triangular system solving, on top of which a recursive PLUQ decomposition algorithm is built. We study the parallelization of these kernels using several algorithmic variants: either iterative or recursive and using different splitting strategies. Experiments show that recursive adaptive methods for matrix multiplication, hybrid recursive-iterative methods for triangular system solve and tile recursive versions of the PLUQ decomposition, together with various data mapping policies, provide the best performance on a 32 cores NUMA architecture. Overall, we show that the overhead of modular reductions is more than compensated by the fast linear algebra algorithms and that exact dense linear algebra matches the performance of full rank reference numerical software even in the presence of rank deficiencies.

[1]  Jack Dongarra,et al.  Numerical Linear Algebra for High-Performance Computers , 1998 .

[2]  Alexandru Nicolau,et al.  Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems: Matrix-multiplication and matrix-addition algorithm optimizations by software pipelining and threads allocation , 2011, TOMS.

[3]  Claude-Pierre Jeannerod,et al.  Rank-profile revealing Gaussian elimination and the CUP matrix decomposition , 2011, J. Symb. Comput..

[4]  Jean-Guillaume Dumas,et al.  Dense Linear Algebra over Word-Size Prime Fields: the FFLAS and FFPACK Packages , 2006, TOMS.

[5]  Jean-Guillaume Dumas,et al.  Finite field linear algebra subroutines , 2002, ISSAC '02.

[6]  Austin R. Benson,et al.  A framework for practical parallel fast matrix multiplication , 2014, PPoPP.

[7]  Jack J. Dongarra,et al.  A proposal for a set of level 3 basic linear algebra subprograms , 1987, SGNM.

[8]  David H. Bailey,et al.  Using Strassen's algorithm to accelerate the solution of linear systems , 1991, The Journal of Supercomputing.

[9]  Luk Bettale,et al.  Cryptanalysis of HFE, multi-HFE and variants for odd and even characteristic , 2012, Designs, Codes and Cryptography.

[10]  James Demmel,et al.  CALU: A Communication Optimal LU Factorization Algorithm , 2011, SIAM J. Matrix Anal. Appl..

[11]  Thierry Gautier,et al.  libKOMP, an Efficient OpenMP Runtime System for Both Fork-Join and Data Flow Paradigms , 2012, IWOMP.

[12]  Masha Sosonkina,et al.  Locality Optimization on a NUMA Architecture for Hybrid LU Factorization , 2013, PARCO.

[13]  Bruno Raffin,et al.  XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[14]  Daniel E. Steffy,et al.  Konrad-zuse-zentrum F ¨ Ur Informationstechnik Berlin Improving the Accuracy of Linear Programming Solvers with Iterative Refinement Improving the Accuracy of Linear Programming Solvers with Iterative Refinement * , 2022 .

[15]  R. Gregory Taylor,et al.  Modern computer algebra , 2002, SIGA.

[16]  Z. Zlatev Computational Methods for General Sparse Matrices , 1991 .

[17]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[18]  Fred G. Gustavson,et al.  Recursion leads to automatic variable blocking for dense linear-algebra algorithms , 1997, IBM J. Res. Dev..

[19]  Jean-Guillaume Dumas,et al.  Simultaneous computation of the row and column rank profiles , 2013, ISSAC '13.

[20]  P. Sadayappan,et al.  A tensor product formulation of Strassen's matrix multiplication algorithm with memory reduction , 1993, [1993] Proceedings Seventh International Parallel Processing Symposium.

[21]  V. Strassen Gaussian elimination is not optimal , 1969 .

[22]  J. Faugère A new efficient algorithm for computing Gröbner bases (F4) , 1999 .

[23]  Jack J. Dongarra,et al.  Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting , 2014, Concurr. Comput. Pract. Exp..

[24]  Sivan Toledo Locality of Reference in LU Decomposition with Partial Pivoting , 1997, SIAM J. Matrix Anal. Appl..

[25]  James Demmel,et al.  Communication-Avoiding Parallel Strassen: Implementation and performance , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[26]  William Stein,et al.  Modular forms, a computational approach , 2007 .

[27]  Jean-Guillaume Dumas,et al.  Computing the Rank Profile Matrix , 2015, ISSAC.

[28]  Robert A. van de Geijn,et al.  Anatomy of a Parallel Out-of-Core Dense Linear Solver , 1995, ICPP.

[29]  Jean-Guillaume Dumas,et al.  Parallel Computation of Echelon Forms , 2014, Euro-Par.