Data-layout optimization based on memory-access-pattern analysis for source-code performance improvement

With the rising impact of the memory wall, selecting the adequate data-structure implementation for a given kernel has become a performance-critical issue. This paper presents a new methodology to solve the data-layout decision problem by adapting an input implementation to the host hardware-memory hierarchy. The proposed method automatically identifies, for a given input software, the most performing data-layout implementation for each selected variable by analyzing the memory-access pattern. The proposed method is designed to be embedded within a general-purpose compiler. Experiments on PolybenchC benchmark, recursive-bilateral filter and jpeg-compression kernels, show that our method accurately determines the optimized data structure implementation. These optimized implementations allow reaching an execution-time speed-up up to 48.9X and a L3-miss reduction up to 98.1X, on an X86 processor implementing an Intel Xeon with three levels of data-caches using the least recently used cache-replacement policy.

[1]  Aviral Shrivastava,et al.  Automatic management of Software Programmable Memories in Many-core Architectures , 2016, IET Comput. Digit. Tech..

[2]  Arthur Griffith GCC, the complete reference , 2002 .

[3]  Gregory K. Wallace,et al.  JPEG still picture compression algorithm , 1991 .

[4]  Mahmut Kandemir,et al.  Memory Systems and Compiler Support for MPSoC Architectures , 2005 .

[5]  Daniele G. Spampinato,et al.  A basic linear algebra compiler for structured matrices , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[6]  Yunheung Paek,et al.  Compiler driven data layout optimization for regular/irregular array access patterns , 2008, LCTES '08.

[7]  Erik Brockmeyer,et al.  Multiprocessor system-on-chip data reuse analysis for exploring customized memory hierarchies , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[8]  Lin Gao,et al.  Memory coloring: a compiler approach for scratchpad memory management , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[9]  Mahmut T. Kandemir,et al.  Dynamic management of scratch-pad memory space , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[10]  Peter Marwedel,et al.  Data partitioning for maximal scratchpad usage , 2003, ASP-DAC '03.

[11]  Keith D. Cooper,et al.  Compiler-controlled memory , 1998, ASPLOS VIII.

[12]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[13]  Qingxiong Yang,et al.  Recursive Bilateral Filtering , 2012, ECCV.

[14]  L. Havlicek,et al.  Robustness of the Pearson Correlation against Violations of Assumptions , 1976 .

[15]  Chen Ding,et al.  Codestitcher: inter-procedural basic block layout optimization , 2018, CC.

[16]  Marc Feeley,et al.  Property caches revisited , 2019, CC.

[17]  Ajay Jain,et al.  Revec: program rejuvenation through revectorization , 2019, CC.

[18]  Abdolmajid Namaki Shoushtari,et al.  Software Assists to On-chip Memory Hierarchy of Manycore Embedded Systems , 2018 .

[19]  Henri-Pierre Charles,et al.  Toward Modeling Cache-Miss Ratio for Dense-Data-Access-Based Optimization , 2019, RSP.

[20]  S. Eranian Perfmon2: a flexible performance monitoring interface for Linux , 2010 .

[21]  Albert Cohen,et al.  The Polyhedral Model Is More Widely Applicable Than You Think , 2010, CC.

[22]  Sayak Ray,et al.  Malware detection using machine learning based analysis of virtual memory access patterns , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[23]  Ahmed Amine Jerraya,et al.  An optimal memory allocation for application-specific multiprocessor system-on-chip , 2001, International Symposium on System Synthesis (IEEE Cat. No.01EX526).