Instruction-Cache Locking for Improving Embedded Systems Performance

Cache memories in embedded systems play an important role in reducing the execution time of applications. Various kinds of extensions have been added to cache hardware to enable software involvement in replacement decisions, improving the runtime over a purely hardware-managed cache. Novel embedded systems, such as Intel’s XScale and ARM Cortex processors, facilitate locking one or more lines in cache; this feature is called cache locking. We present a method in for instruction-cache locking that is able to reduce the average-case runtime of a program. We demonstrate that the optimal solution for instruction cache locking can be obtained in polynomial time. However, a fundamental lack of correlation between cache hardware and software program points renders such optimal solutions impractical. Instead, we propose two practical heuristics-based approaches to achieve cache locking. First, we present a static mechanism for locking the cache, in which the locked contents of the cache are kept fixed over the execution of the program. Next, we present a dynamic mechanism that accounts for changing program requirements at runtime. We devise a cost--benefit model to discover the memory addresses that should be locked in the cache. We implement our scheme inside a binary rewriter, widening the applicability of our scheme to binaries compiled using any compiler. Results obtained on a suite of MiBench benchmarks show that our static mechanism results in 20% improvement in the instruction-cache miss rate on average and up to 18% improvement in the execution time on average for applications having instruction accesses as a bottleneck, compared to no cache locking. The dynamic mechanism improves the cache miss rate by 35% on average and execution time by 32% on instruction-cache-constrained applications.

[1]  Henrik Theiling,et al.  Compile-time decided instruction cache locking using worst-case execution paths , 2007, 2007 5th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[2]  Yun Liang,et al.  Instruction cache locking using temporal reuse profile , 2010, Design Automation Conference.

[3]  Rajeev Barua,et al.  Dynamic allocation for scratch-pad memory using compile-time decisions , 2006, TECS.

[4]  Rajeev Barua,et al.  An optimal memory allocation scheme for scratch-pad-based embedded systems , 2002, TECS.

[5]  Peter Marwedel,et al.  Assigning program and data objects to scratchpad for energy reduction , 2002, Proceedings 2002 Design, Automation and Test in Europe Conference and Exhibition.

[6]  Koen De Bosschere,et al.  Link-time binary rewriting techniques for program compaction , 2005, TOPL.

[7]  Isabelle Puaut,et al.  Low-complexity algorithms for static cache locking in multitasking hard real-time systems , 2002, 23rd IEEE Real-Time Systems Symposium, 2002. RTSS 2002..

[8]  Richard J. Enbody,et al.  Optimal replacement is NP-hard for nonstandard caches , 2004, IEEE Transactions on Computers.

[9]  Peter Marwedel,et al.  Dynamic overlay of scratchpad memory for energy minimization , 2004, International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004..

[10]  Gary S. Tyson,et al.  Active Management of Data Caches by Exploiting Reuse Information , 1999, IEEE Trans. Computers.

[11]  Peter Marwedel,et al.  Cache-aware scratchpad allocation algorithm , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[12]  Kathryn S. McKinley,et al.  Cooperative caching with keep-me and evict-me , 2005, 9th Annual Workshop on Interaction between Compilers and Computer Architectures (INTERACT'05).

[13]  Björn Lisper,et al.  Data cache locking for higher program predictability , 2003, SIGMETRICS '03.

[14]  Nikil D. Dutt,et al.  On-chip vs. off-chip memory: the data partitioning problem in embedded processor-based systems , 2000, TODE.

[15]  Frank Vahid,et al.  Exploiting Fixed Programs in Embedded Systems: A Loop Cache Example , 2002, IEEE Computer Architecture Letters.

[16]  Minming Li,et al.  Instruction Cache Locking for Real-Time Embedded Systems with Multi-tasks , 2009, 2009 15th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications.

[17]  Nikil D. Dutt,et al.  A first look at the interplay of code reordering and configurable caches , 2005, GLSVLSI '05.

[18]  Sandro Bartolini,et al.  Link-time optimization for power efficiency in a tagless instruction cache , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[19]  Laszlo A. Belady,et al.  A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..

[20]  Isabelle Puaut Cache analysis vs static cache locking for schedulability analysis in multitasking real-time systems , 2002 .

[21]  Minming Li,et al.  Instruction Cache Locking for Embedded Systems using Probability Profile , 2012, J. Signal Process. Syst..

[22]  Rajeev Barua,et al.  Instruction cache locking inside a binary rewriter , 2009, CASES '09.

[23]  W. W. Hwu,et al.  Achieving high instruction cache performance with an optimizing compiler , 1989, ISCA '89.

[24]  Guang R. Gao,et al.  Improving power efficiency with compiler-assisted cache replacement , 2005, J. Embed. Comput..

[25]  Srinivas Devadas,et al.  Application-specific memory management for embedded systems using software-controlled caches , 2000, Proceedings 37th Design Automation Conference.

[26]  Kristof Beyls,et al.  Generating cache hints for improved program efficiency , 2005, J. Syst. Archit..

[27]  D. Burger,et al.  Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[28]  Thomas Lindgren,et al.  Allocation of Global Data Objects in On-Chip RAM , 1998 .

[29]  Jeffrey K. Hollingsworth,et al.  An API for Runtime Code Patching , 2000, Int. J. High Perform. Comput. Appl..

[30]  Brani Vidakovic,et al.  USING GENETIC ALGORITHMS IN CONTENT SELECTION FOR LOCKING-CACHES , 2001 .

[31]  David B. Whalley,et al.  Improving WCET by applying a WC code-positioning optimization , 2005, TACO.

[32]  Rajeev Barua,et al.  A compiler-level intermediate representation based binary analysis and rewriting system , 2013, EuroSys '13.

[33]  Olivier Temam,et al.  Investigating optimal local memory performance , 1998, ASPLOS VIII.

[34]  Peter Marwedel,et al.  Scratchpad memory: a design alternative for cache on-chip memory in embedded systems , 2002, Proceedings of the Tenth International Symposium on Hardware/Software Codesign. CODES 2002 (IEEE Cat. No.02TH8627).