A compiler-based approach for dynamically managing scratch-pad memories in embedded systems

Optimizations aimed at improving the efficiency of on-chip memories in embedded systems are extremely important. Using a suitable combination of program transformations and memory design space exploration aimed at enhancing data locality enables significant reductions in effective memory access latencies. While numerous compiler optimizations have been proposed to improve cache performance, there are relatively few techniques that focus on software-managed on-chip memories. It is well-known that software-managed memories are important in real-time embedded environments with hard deadlines as they allow one to accurately predict the amount of time a given code segment will take. In this paper, we propose and evaluate a compiler-controlled dynamic on-chip scratch-pad memory (SPM) management framework. Our framework includes an optimization suite that uses loop and data transformations, an on-chip memory partitioning step, and a code-rewriting phase that collectively transform an input code automatically to take advantage of the on-chip SPM. Compared with previous work, the proposed scheme is dynamic, and allows the contents of the SPM to change during the course of execution, depending on the changes in the data access pattern. Experimental results from our implementation using a source-to-source translator and a generic cost model indicate significant reductions in data transfer activity between the SPM and off-chip memory.

[1]  Jennifer Eyre,et al.  DSP Processors Hit the Mainstream , 1998, Computer.

[2]  Ken Kennedy,et al.  Improving cache performance in dynamic applications through data and computation reorganization at run time , 1999, PLDI '99.

[3]  John Zahorjan,et al.  Optimizing Data Locality by Array Restructuring , 1995 .

[4]  Donald Yeung,et al.  Evaluating the impact of memory system performance on software prefetching and locality optimizations , 2001, ICS '01.

[5]  W. Jalby,et al.  To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts , 1993, Supercomputing '93.

[6]  Steven K. Reinhardt,et al.  A fully associative software-managed cache design , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[7]  Wei Li,et al.  Unifying data and control transformations for distributed shared-memory machines , 1995, PLDI '95.

[8]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[9]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[10]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[11]  Michael F. P. O'Boyle,et al.  Integrating Loop and Data Transformations for Global Optimization , 2002, J. Parallel Distributed Comput..

[12]  H. De Man,et al.  Global communication and memory optimizing transformations for low power signal processing systems , 1994, Proceedings of 1994 IEEE Workshop on VLSI Signal Processing.

[13]  Mahmut T. Kandemir,et al.  Improving locality using loop and data transformations in an integrated framework , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[14]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[15]  Sally A. McKee,et al.  Access ordering and memory-conscious cache utilization , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[16]  Santosh Pande,et al.  Optimizing On-Chip Memory Usage Through Loop Restructuring for Embedded Processors , 2000 .

[17]  Michael F. P. O'Boyle,et al.  Integrating loop and data transformations for global optimisation , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[18]  David A. Patterson,et al.  Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .

[19]  Chau-Wen Tseng,et al.  Improving Locality for Adaptive Irregular Scientific Codes , 2000, LCPC.

[20]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[21]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[22]  Chaitali Chakrabarti,et al.  Memory exploration for low power embedded systems , 1999, ISCAS'99. Proceedings of the 1999 IEEE International Symposium on Circuits and Systems VLSI (Cat. No.99CH36349).

[23]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.

[24]  Monica S. Lam,et al.  Automatic computation and data decomposition for multiprocessors , 1997 .

[25]  Sanjay Ranka,et al.  Memory hierarchy management for iterative graph structures , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[26]  Ken Kennedy,et al.  Improving memory hierarchy performance for irregular applications , 1999, ICS '99.

[27]  Francky Catthoor,et al.  Custom Memory Management Methodology: Exploration of Memory Organisation for Embedded Multimedia System Design , 1998 .

[28]  Mahmut T. Kandemir,et al.  Energy-driven integrated hardware-software optimizations using SimplePower , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[29]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[30]  Keith D. Cooper,et al.  Compiler-controlled memory , 1998, ASPLOS VIII.

[31]  Luca Benini,et al.  Increasing Energy Efficiency of Embedded Systems by Application-Specific Memory Hierarchy Generation , 2000, IEEE Des. Test Comput..

[32]  Saman Amarasinghe,et al.  The suif compiler for scalable parallel machines , 1995 .

[33]  Mahmut T. Kandemir,et al.  Influence of compiler optimizations on system power , 2000, Proceedings 37th Design Automation Conference.

[34]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[35]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools (2nd Edition) , 2006 .

[36]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[37]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[38]  J. Eyre,et al.  The evolution of DSP processors , 2000, IEEE Signal Process. Mag..

[39]  Nikil D. Dutt,et al.  Architectural exploration and optimization of local memory in embedded systems , 1997, Proceedings. Tenth International Symposium on System Synthesis (Cat. No.97TB100114).

[40]  Anant Agarwal,et al.  Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors , 1995, IEEE Trans. Parallel Distributed Syst..

[41]  Nikil D. Dutt,et al.  Efficient utilization of scratch-pad memory in embedded processor applications , 1997, Proceedings European Design and Test Conference. ED & TC 97.

[42]  Dennis Gannon,et al.  Strategies for cache and local memory management by global program transformation , 1988, J. Parallel Distributed Comput..

[43]  Larry Carter,et al.  Localizing non-affine array references , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[44]  Utpal Banerjee,et al.  Dependence analysis for supercomputing , 1988, The Kluwer international series in engineering and computer science.