Fractal Matrix Multiplication: A Case Study on Portability of Cache Performance

The practical portability of a simple version of matrix multiplication is demonstrated.The multiplication algorithm is designed to exploit maximal and predictable locality at all levels of the memory hierarchy, with no a priori knowledge of the specific memory system organization for any particular machine.B y both simulations and execution on a number of platforms, we show that memory hierarchies portability does not sacrifice floating point performance; indeed, it is always a significant fraction of peak and, at least on one machine, is higher than the tuned routines by both ATLAS and vendor. The results are obtained by careful algorithm engineering, which combines a number of known as well as novel implementation ideas.This effort can be viewed as an experimental case study, complementary to the theoretical investigations on portability of cache performance begun by Bilardi and Peserico

[1]  Gaston H. Gonnet,et al.  The analysis of multidimensional searching in quad-trees , 1991, SODA '91.

[2]  W. Jalby,et al.  To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts , 1993, Supercomputing '93.

[3]  Alok Aggarwal,et al.  Hierarchical memory with block transfer , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[4]  Don Coppersmith,et al.  Matrix multiplication via arithmetic progressions , 1987, STOC.

[5]  Steven G. Johnson,et al.  The Fastest Fourier Transform in the West , 1997 .

[6]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[7]  Bo Kågström,et al.  Algorithm 784: GEMM-based level 3 BLAS: portability and optimization issues , 1998, TOMS.

[8]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[9]  Matteo Frigo,et al.  Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[10]  Nicholas J. Higham,et al.  INVERSE PROBLEMS NEWSLETTER , 1991 .

[11]  F. P. Preparata,et al.  Processor—Time Tradeoffs under Bounded-Speed Message Propagation: Part I, Upper Bounds , 1995, Theory of Computing Systems.

[12]  Mithuna Thottethodi,et al.  Nonlinear array layouts for hierarchical memory systems , 1999, ICS '99.

[13]  Michael Rodeh,et al.  Matrix Multiplication: A Case Study of Algorithm Engineering , 1998, WAE.

[14]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[15]  Andrea Pietracaprina,et al.  On the Space and Access Complexity of Computation DAGs , 2000, WG.

[16]  Gianfranco Bilardi,et al.  An approach towards an analytical characterization of locality and its portability , 2001, 2001 Innovative Architecture for Future Generation High-Performance Processors and Systems.

[17]  Fred G. Gustavson,et al.  Recursion leads to automatic variable blocking for dense linear-algebra algorithms , 1997, IBM J. Res. Dev..

[18]  Jeremy D. Frens,et al.  Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code , 1997, PPOPP '97.

[19]  Hiroshi Nakamura,et al.  Improving cache Performance Through Tiling and Data Alignment , 1997, IRREGULAR.

[20]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[21]  Mithuna Thottethodi,et al.  Tuning Strassen's Matrix Multiplication for Memory Efficiency , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[22]  Josef Stoer,et al.  Numerische Mathematik 1 , 1989 .

[23]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[24]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[25]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[26]  Bo Kågström,et al.  GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark , 1998, TOMS.

[27]  Isak Jonsson,et al.  Recursive Blocked Data Formats and BLAS's for Dense Linear Algebra Algorithms , 1998, PARA.

[28]  Bowen Alpern,et al.  A model for hierarchical memory , 1987, STOC.

[29]  Mithuna Thottethodi,et al.  Recursive array layouts and fast parallel matrix multiplication , 1999, SPAA '99.

[30]  Gianfranco Bilardi,et al.  A Characterization of Temporal Locality and Its Portability across Memory Hierarchies , 2001, ICALP.

[31]  David S. Wise Undulant-Block Elimination and Integer-Preserving Matrix Inversion , 1999, Sci. Comput. Program..

[32]  Ken Kennedy,et al.  Compiler blockability of numerical algorithms , 1992, Proceedings Supercomputing '92.

[33]  Sivan Toledo Locality of Reference in LU Decomposition with Partial Pivoting , 1997, SIAM J. Matrix Anal. Appl..

[34]  Michel J. Daydé,et al.  The RISC BLAS: a blocked implementation of level 3 BLAS for RISC processors , 1999, TOMS.

[35]  Rudolf Eigenmann,et al.  Automatic program parallelization , 1993, Proc. IEEE.

[36]  P. D''Alberto Performance Evaluation of Data Locality Exploitation (Ph.D. Thesis) , 2000 .

[37]  V. Strassen Gaussian elimination is not optimal , 1969 .

[38]  Dhabaleswar K. Panda,et al.  Communication and Architectural Support for Network-Based Parallel Computing , 1997, Lecture Notes in Computer Science.