Toward Modeling Cache-Miss Ratio for Dense-Data-Access-Based Optimization

Adapting a source code to the specificity of its host hardware represents one way to implement software optimization. This allows to benefit from processors that are primarily designed to improve system performance. To reach such a software/hardware fitting without narrowing the scope of the optimization to few executions, one needs to have at his disposal relevant performance models of the considered hardware. This paper proposes a new method to optimize software kernels by considering their data-access mode. The proposed method permits to build a data-cache-miss model of a given application regarding its specific memory-access pattern. We apply our method in order to evaluate some custom implementations of matrix data layouts. To validate the functional correctness of the generated models, we propose a reference algorithm that simulates a kernel's exploration of its data. Experimental results show that the proposed data alignment permits to reduce the number of cache misses by a factor up to 50%, and to decrease the execution time by up to 30%. Finally, we show the necessity to integrate the impact of the Translation Lookaside Buffers (TLB) and the memory prefetcher within our performance models.

[1]  Abhishek Bhattacharjee,et al.  Efficient Address Translation for Architectures with Multiple Page Sizes , 2017, ASPLOS.

[2]  Henri-Pierre Charles,et al.  deGoal a Tool to Embed Dynamic Code Generators into Applications , 2014, CC.

[3]  Wesley W. Chu,et al.  The page fault frequency replacement algorithm , 1972, AFIPS '72 (Fall, part I).

[4]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[5]  Daniele G. Spampinato,et al.  A basic linear algebra compiler for structured matrices , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[6]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[7]  Alvis Cheuk M. Fong,et al.  Applying Supervised Learning to the Static Prediction of Locality-Pattern Complexity in Scientific Code , 2018, 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA).

[8]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[9]  David A. Patterson,et al.  A new golden age for computer architecture , 2019, Commun. ACM.

[10]  Millad Ghane,et al.  False Sharing Detection in OpenMP Applications Using OMPT API , 2015, IWOMP.

[11]  John McCarthy,et al.  History of LISP , 1978, SIGP.

[12]  Albert Cohen,et al.  The Polyhedral Model Is More Widely Applicable Than You Think , 2010, CC.

[13]  Vania Marangozova-Martin,et al.  BOAST: Bringing Optimization through Automatic Source-to-Source Transformations , 2013, 2013 IEEE 7th International Symposium on Embedded Multicore Socs.

[14]  Lars Lundberg,et al.  Optimizing dynamic memory management in a multithreaded application executing on a multiprocessor , 1998, Proceedings. 1998 International Conference on Parallel Processing (Cat. No.98EX205).

[15]  Chen Ding,et al.  Miss Rate Prediction Across Program Inputs and Cache Configurations , 2007, IEEE Transactions on Computers.

[16]  Sid Lakhdar,et al.  On the Impact of Asynchronous I/O on the performance of the Cube re-mapper at High Performance Computing Scale , 2017 .

[17]  Richard O’Neil,et al.  Convolution operators and $L(p,q)$ spaces , 1963 .