Hardware-Oriented Implementation of Cache Oblivious Matrix Operations Based on Space-Filling Curves

We will present hardware-oriented implementations of blockrecursive approaches for matrix operations, esp. matrix multiplication and LU decomposition. An element order based on a recursively constructed Peano space-filling curve is used to store the matrix elements. This block-recursive numbering scheme is changed into a standard rowmajor order, as soon as the respective matrix subblocks fit into level-1 cache. For operations on these small blocks, we implemented hardwareoriented kernels optimised for Intel's Core architecture. The resulting matrix-multiplication and LU-decomposition codes compete well with optimised libraries such as Intel's MKL, ATLAS, or GotoBLAS, but have the advantage that only comparably small and well-defined kernel operations have to be optimised to achieve high performance.