The generalized matrix chain algorithm

In this paper, we present a generalized version of the matrix chain algorithm to generate efficient code for linear algebra problems, a task for which human experts often invest days or even weeks of works. The standard matrix chain problem consists in finding the parenthesization of a matrix product M := A1 A2 ⋯ An that minimizes the number of scalar operations. In practical applications, however, one frequently encounters more complicated expressions, involving transposition, inversion, and matrix properties. Indeed, the computation of such expressions relies on a set of computational kernels that offer functionality well beyond the simple matrix product. The challenge then shifts from finding an optimal parenthesization to finding an optimal mapping of the input expression to the available kernels. Furthermore, it is often the case that a solution based on the minimization of scalar operations does not result in the optimal solution in terms of execution time. In our experiments, the generated code outperforms other libraries and languages on average by a factor of about 9. The motivation for this work comes from the fact that—despite great advances in the development of compilers—the task of mapping linear algebra problems to optimized kernels is still to be done manually. In order to relieve the user from this complex task, new techniques for the compilation of linear algebra expressions have to be developed.

[1]  Roger L. Wainwright,et al.  Parallelization of the dynamic programming algorithm for the matrix chain product on a hypercube , 1990, Proceedings of the 1990 Symposium on Applied Computing.

[2]  Paolo Bientinesi,et al.  Knowledge-Based Automatic Generation of Partitioned Matrix Expressions , 2011, CASC.

[3]  Todd L. Veldhuizen,et al.  Arrays in Blitz++ , 1998, ISCOPE.

[4]  Man-Tak Shing,et al.  Computation of Matrix Chain Products. Part II , 1984, SIAM J. Comput..

[5]  Ulrich Rüde,et al.  Expression Templates Revisited: A Performance Analysis of Current Methodologies , 2011, SIAM J. Sci. Comput..

[6]  Lloyd N. Trefethen,et al.  How Fast are Nonsymmetric Matrix Iterations? , 1992, SIAM J. Matrix Anal. Appl..

[7]  Paolo Bientinesi,et al.  A Study on the Influence of Caching: Sequences of Dense Linear Algebra Kernels , 2014, VECPAR.

[8]  Paolo Bientinesi,et al.  The ELAPS framework: Experimental Linear Algebra Performance Studies , 2015, Int. J. High Perform. Comput. Appl..

[9]  Phillip G. Bradford,et al.  Efficient Matrix Chain Ordering in Polylog Time , 1998, SIAM J. Comput..

[10]  Manuel Krebber,et al.  Non-linear Associative-Commutative Many-to-One Pattern Matching with Sequence Variables , 2017, ArXiv.

[11]  Alexander M. Bronstein,et al.  Consistent Discretization and Minimization of the L1 Norm on Manifolds , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[12]  Koji Nakano,et al.  Accelerating the Dynamic Programming for the Matrix Chain Product on the GPU , 2011, 2011 Second International Conference on Networking and Computing.

[13]  Lothar Reichel,et al.  Some matrix nearness problems suggested by Tikhonov regularization , 2016, 1602.03307.

[14]  Behrooz Makki,et al.  Robust Successive Compute-and-Forward Over Multiuser Multirelay Networks , 2016, IEEE Transactions on Vehicular Technology.

[15]  Paolo Bientinesi,et al.  Application-tailored linear algebra algorithms , 2013, Int. J. High Perform. Comput. Appl..

[16]  Daniele G. Spampinato,et al.  A basic linear algebra compiler for structured matrices , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[17]  Nisheeth K. Vishnoi,et al.  On a Natural Dynamics for Linear Programming , 2015, ITCS.

[18]  Ivan W. Selesnick,et al.  Sparsity-based correction of exponential artifacts , 2015, Signal Process..

[19]  Francis Y. L. Chin,et al.  An O(n) algorithm for determining a near-optimal computation order of matrix chain products , 1978, CACM.

[20]  Paolo Bientinesi,et al.  Performance Modeling for Dense Linear Algebra , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[21]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[22]  Fathi M. Salem,et al.  A Blind Adaptive CDMA Receiver Based on State Space Structures , 2014, ArXiv.

[23]  Nadia Nedjah,et al.  Optimal Left-to-Right Pattern-Matching Automata , 1997, ALP/HOA.

[24]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[25]  Paolo Bientinesi,et al.  Application-tailored linear algebra algorithms , 2012, Int. J. High Perform. Comput. Appl..

[26]  Xinwei Deng,et al.  A Parallel Implementation of the Ensemble Kalman Filter Based on Modified Cholesky Decomposition , 2019, J. Comput. Sci..

[27]  Albert Gräf,et al.  Left-to-Right Tree Pattern Matching , 1991, RTA.

[28]  Jim Christian,et al.  Flatterms, discrimination nets, and fast term rewriting , 1993, Journal of Automated Reasoning.

[29]  Artur Czumaj Very Fast Approximation of the Matrix Chain Product Problem , 1996, J. Algorithms.

[30]  Alan Edelman,et al.  Julia: A Fast Dynamic Language for Technical Computing , 2012, ArXiv.

[31]  Jack Dongarra,et al.  High Performance Computing for Computational Science , 2003 .

[32]  Jack Dongarra,et al.  Corrigenda: “An Extended Set of FORTRAN Basic Linear Algebra Subprograms” , 1988, TOMS.

[33]  Robert A. van de Geijn,et al.  Deriving Correct High-Performance Algorithms , 2017, ArXiv.

[34]  Adrian Sandu,et al.  CSL-TR-15-21 November 6 , 2015 , 2015 .

[35]  Jaeyoung Choi,et al.  The design of a parallel dense linear algebra software library: Reduction to Hessenberg, tridiagonal, and bidiagonal form , 1995, Numerical Algorithms.

[36]  Markus Püschel,et al.  A Basic Linear Algebra Compiler , 2014, CGO '14.

[37]  Robert A. van de Geijn,et al.  Families of algorithms related to the inversion of a Symmetric Positive Definite matrix , 2008, TOMS.

[38]  Paolo Bientinesi,et al.  Automatic Generation of Loop-Invariants for Matrix Operations , 2011, 2011 International Conference on Computational Science and Its Applications.

[39]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[40]  Prakesh Ramanan An Efficient Parallel Algorithm for the Martix-Chain-Product Problem , 1996, SIAM J. Comput..

[41]  Henrik Barthels A compiler for linear algebra operations , 2016, SPLASH.

[42]  Jack J. Dongarra,et al.  An extended set of FORTRAN basic linear algebra subprograms , 1988, TOMS.

[43]  Marc Timme,et al.  A Dual Method for Computing Power Transfer Distribution Factors , 2015, IEEE Transactions on Power Systems.

[44]  Elizabeth R. Jessup,et al.  Build to order linear algebra kernels , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[45]  Paolo Bientinesi,et al.  A Domain-Specific Compiler for Linear Algebra Operations , 2012, VECPAR.

[46]  Conrad Sanderson,et al.  Armadillo: An Open Source C++ Linear Algebra Library for Fast Prototyping and Computationally Intensive Experiments , 2010 .

[47]  Heejo Lee,et al.  Processor Allocation and Task Scheduling of Matrix Chain Products on Parallel Systems , 2003, IEEE Trans. Parallel Distributed Syst..