A Case for Malleable Thread-Level Linear Algebra Libraries: The LU Factorization With Partial Pivoting

We propose two novel techniques for overcoming load-imbalance encountered when implementing so-called look-ahead mechanisms in relevant dense matrix factorizations for the solution of linear systems. Both techniques target the scenario where two thread teams are created/activated during the factorization, with each team in charge of performing an independent task/branch of execution. The first technique promotes worker sharing (WS) between the two tasks, allowing the threads of the task that completes first to be reallocated for use by the costlier task. The second technique allows a fast task to alert the slower task of completion, enforcing the early termination (ET) of the second task, and a smooth transition of the factorization procedure into the next iteration. The two mechanisms are instantiated via a new malleable thread-level implementation of the basic linear algebra subprograms, and their benefits are illustrated via an implementation of the LU factorization with partial pivoting enhanced with look-ahead. Concretely, our experimental results on an Intel-Xeon system with 12 cores show the benefits of combining WS+ET, reporting competitive performance in comparison with a task-parallel runtime-based solution.

[1]  P. Strazdins A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization , 1998 .

[2]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[3]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[4]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[5]  Jack J. Dongarra,et al.  An extended set of FORTRAN basic linear algebra subprograms , 1988, TOMS.

[6]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[7]  Robert A. van de Geijn,et al.  FLAME: Formal Linear Algebra Methods Environment , 2001, TOMS.

[8]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[9]  Robert A. van de Geijn,et al.  BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015, ACM Trans. Math. Softw..

[10]  Robert A. van de Geijn,et al.  High-performance implementation of the level-3 BLAS , 2008, TOMS.

[11]  Golub Gene H. Et.Al Matrix Computations, 3rd Edition , 2007 .

[12]  Rafael Mayo,et al.  Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors , 2015, Cluster Computing.

[13]  Tze Meng Low,et al.  The BLIS Framework , 2016 .

[14]  Robert A. van de Geijn,et al.  Anatomy of High-Performance Many-Threaded Matrix Multiplication , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[15]  Enrique S. Quintana-Ortí,et al.  Static Versus Dynamic Task Scheduling of the Lu Factorization on ARM big. LITTLE Architectures , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[16]  Jack Dongarra,et al.  LAPACK Users' Guide, 3rd ed. , 1999 .