Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

An elementary, machine-independent, recursive algorithm for matrix multiplication C+=A*B provides implicit blocking at every level of the memory hierarchy and tests out faster than classically optimrd code, tracking hand-coded BLAS3 routines. Proof of concept is demonstrated by racing the in-place algorithm against manufacturer's hand-tuned BLAS3 routines; it can win.The recursive code bifurcates naturally at the top level into independent block-oriented processes, that each writes to a disjoint and contiguous region of memory. Experience has shown that the indexing vastly improves the patterns of memory access at all levels of the memory hierarchy, independently of the sizes of caches or pages and without ad hoc programming. It also exposed a weakness in SGI's C compilers that merrily unroll loops for the super-scalar R8000 processor, but do not analogously unfold the base cases of the most elementary recursions. Such deficiencies might deter future programmers from using this rich class of recursive algorithms.

[1]  Ken Kennedy,et al.  A model and compilation strategy for out-of-core data parallel programs , 1995, PPOPP '95.

[2]  Evan J. Englund Matrix Inversion using Quadtrees Implemented in Gofer , 1995 .

[3]  James Demmel,et al.  Stability of block algorithms with fast level-3 BLAS , 1992, TOMS.

[4]  Monica S. Lam,et al.  Data and computation transformations for multiprocessors , 1995, PPOPP '95.

[5]  Jack J. Dongarra,et al.  An extended set of FORTRAN basic linear algebra subprograms , 1988, TOMS.

[6]  Ramesh Subramonian,et al.  LogP: a practical model of parallel computation , 1996, CACM.

[7]  Patrick C. Fischer,et al.  Storage reorganization techniques for matrix computation in a paging environment , 1979, CACM.

[8]  Donald E. Knuth,et al.  The Art of Computer Programming, Volume I: Fundamental Algorithms, 2nd Edition , 1997 .

[9]  David C. Cann,et al.  Retire Fortran?: a debate rekindled , 1992, CACM.

[10]  F. Warren Burton,et al.  Comment on 'the explicit quad tree as a structure for computer graphics , 1983 .

[11]  David S. Wise Undulant-Block Elimination and Integer-Preserving Matrix Inversion , 1999, Sci. Comput. Program..

[12]  Paul Hudak,et al.  A gentle introduction to Haskell , 1992, SIGP.

[13]  Nicholas J. Higham,et al.  Exploiting fast matrix multiplication within the level 3 BLAS , 1990, TOMS.

[14]  David S. Wise Representing matrices as quadtrees for parallel processors: extended abstract , 1984, SIGS.

[15]  David C. Cann,et al.  Retire Fortran? A debate rekindled , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[16]  Steve Carr,et al.  Compiler blockability of dense matrix factorizations , 1997, TOMS.

[17]  V. Strassen Gaussian elimination is not optimal , 1969 .

[18]  Donald E. Knuth The art of computer programming: fundamental algorithms , 1969 .

[19]  Wei Li,et al.  Unifying data and control transformations for distributed shared-memory machines , 1995, PLDI '95.

[20]  Richard J. Fateman Symbolic mathematics system evaluators (extended abstract) , 1996, ISSAC '96.

[21]  Edward G. Coffman,et al.  Organizing matrices and matrix operations for paged memory systems , 1969, Commun. ACM.

[22]  Guy L. Steele Debunking the “expensive procedure call” myth or, procedure call implementations considered harmful or, LAMBDA: The Ultimate GOTO , 1977, ACM '77.

[23]  K. A. Gallivan,et al.  Parallel Algorithms for Dense Linear Algebra Computations , 1990, SIAM Rev..

[24]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[25]  David S. Wise Representing Matrices as Quadtrees for Parallel Processors , 1985, Inf. Process. Lett..