Cache oblivious storage and access heuristics for blocked matrix-matrix multiplication

The authors investigate effects of ordering in blocked matrix-matrix multiplication. They find that submatrices do not have to be stored contiguously in memory in order to achieve near optimal performance. They also find a good choice of execution order of submatrix operations can lead to a speedup of up to four times for small block sizes.