Cache-oblivious algorithms

This article presents asymptotically optimal algorithms for rectangular matrix transpose, fast Fourier transform (FFT), and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are <i>cache oblivious</i>: no variables dependent on hardware parameters, such as cache size and cache-line length, need to be tuned to achieve optimality. Nevertheless, these algorithms use an optimal amount of work and move data optimally among multiple levels of cache. For a cache with size <i>M</i> and cache-line length <i>B</i> where <i>M</i> = <i>Ω</i>(<i>B</i><sup>2</sup>), the number of cache misses for an <i>m</i> × <i>n</i> matrix transpose is <i>Θ</i>(1 + <i>mn</i>/<i>B</i>). The number of cache misses for either an <i>n</i>-point FFT or the sorting of <i>n</i> numbers is <i>Θ</i>(1 + (<i>n</i>/<i>B</i>)(1 + log<i>M n</i>)). We also give a <i>Θ</i>(<i>mnp</i>)-work algorithm to multiply an <i>m</i> × <i>n</i> matrix by an <i>n</i> × <i>p</i> matrix that incurs <i>Θ</i>(1 + (<i>mn</i> + <i>np</i> + <i>mp</i>)/<i>B</i> + <i>mnp</i>/<i>B</i>√<i>M</i>) cache faults. We introduce an “ideal-cache” model to analyze our algorithms. We prove that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the ideal-cache model can be simulated efficiently by LRU replacement. We offer empirical evidence that cache-oblivious algorithms perform well in practice.

[1]  Michael A. Bender,et al.  Cache-oblivious B-trees , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[2]  Sandeep Sen,et al.  Towards a theory of cache-efficient algorithms , 2000, SODA '00.

[3]  R. Singleton An algorithm for computing the mixed radix fast Fourier transform , 1969 .

[4]  Matteo Frigo,et al.  A fast Fourier transform compiler , 1999, SIGP.

[5]  Mithuna Thottethodi,et al.  Nonlinear array layouts for hierarchical memory systems , 1999, ICS '99.

[6]  Alok Aggarwal,et al.  Hierarchical memory with block transfer , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[7]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[8]  Jeffrey Scott Vitter,et al.  Large-Scale Sorting in Uniform Memory Hierarchies , 1993, J. Parallel Distributed Comput..

[9]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[10]  Gianfranco Bilardi,et al.  A Characterization of Temporal Locality and Its Portability across Memory Hierarchies , 2001, ICALP.

[11]  John E. Savage Extending the Hong-Kung Model to Memory Hierarchies , 1995, COCOON.

[12]  Matteo Frigo,et al.  DAG-consistent distributed shared memory , 1996, Proceedings of International Conference on Parallel Processing.

[13]  Erik D. Demaine,et al.  Cache-Oblivious Algorithms and Data Structures , 2003 .

[14]  VitterJeffrey Scott External memory algorithms and data structures , 2001 .

[15]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[16]  Sivan Toledo Locality of Reference in LU Decomposition with Partial Pivoting , 1997, SIAM J. Matrix Anal. Appl..

[17]  Bowen Alpern,et al.  A model for hierarchical memory , 1987, STOC.

[18]  Matteo Frigo,et al.  An analysis of dag-consistent distributed shared-memory algorithms , 1996, SPAA '96.

[19]  David A. Patterson,et al.  Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .

[20]  Martin C. Rinard,et al.  Automatic parallelization of divide and conquer algorithms , 1999, PPoPP '99.

[21]  Laszlo A. Belady,et al.  A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..

[22]  Richard E. Ladner,et al.  The influence of caches on the performance of sorting , 1997, SODA '97.

[23]  Mithuna Thottethodi,et al.  Recursive array layouts and fast parallel matrix multiplication , 1999, SPAA '99.

[24]  Robert Sedgewick,et al.  Algorithms in C , 1990 .

[25]  S. Winograd ON THE ALGEBRAIC COMPLEXITY OF FUNCTIONS , 1970 .

[26]  J. G. Woodward,et al.  IEEE TRANSACTIONS@ ON AUDIO AND ELECTROACOUSTICS , 1968 .

[27]  Jeremy D. Frens,et al.  Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code , 1997, PPOPP '97.

[28]  Jeffrey Scott Vitter,et al.  Algorithms for parallel memory, II: Hierarchical multilevel memories , 1992, Algorithmica.

[29]  Jeffrey Scott Vitter,et al.  Deterministic distribution sort in shared and distributed memory multiprocessors , 1993, SPAA '93.

[30]  Robert E. Tarjan,et al.  Amortized efficiency of list update and paging rules , 1985, CACM.

[31]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[32]  Lars Arge,et al.  Cache-Oblivious Data Structures , 2004, Handbook of Data Structures and Applications.

[33]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[34]  David H. Bailey,et al.  FFTs in external or hierarchical memory , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[35]  Martin Vetterli,et al.  Fast Fourier transforms: a tutorial review and a state of the art , 1990 .

[36]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[37]  V. Strassen Gaussian elimination is not optimal , 1969 .

[38]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[39]  Jeffrey Scott Vitter,et al.  Algorithms for parallel memory, I: Two-level memories , 2005, Algorithmica.

[40]  Jeffrey Scott Vitter,et al.  External memory algorithms and data structures , 1999, External Memory Algorithms.

[41]  Lars Arge,et al.  Cache-Oblivious Data Structures , 2004 .

[42]  Bowen Alpern,et al.  Uniform memory hierarchies , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[43]  Donald E. Knuth,et al.  Sorting and Searching , 1973 .

[44]  David A. Patterson,et al.  Computer architecture (2nd ed.): a quantitative approach , 1996 .

[45]  David A. Patterson,et al.  Computer Architecture - A Quantitative Approach (4. ed.) , 2007 .

[46]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[47]  Matteo Frigo,et al.  Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[48]  Allan Borodin,et al.  Online computation and competitive analysis , 1998 .

[49]  Larry Carter,et al.  Towards an optimal bit-reversal permutation program , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[50]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[51]  Gerth Stølting Brodal,et al.  Cache-Oblivious Algorithms and Data Structures , 2004, SWAT.

[52]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[53]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[54]  Jon Louis Bentley,et al.  Writing efficient programs , 1982 .