Instruction cache locking using temporal reuse profile

The performance of most embedded systems is critically dependent on the average memory access latency. Improving the cache hit rate can have significant positive impact on the performance of an application. Modern embedded processors often feature cache locking mechanisms that allow memory blocks to be locked in the cache under software control. Cache locking was primarily designed to offer timing predictability for hard real-time applications. Hence, the compiler optimization techniques focus on employing cache locking to improve worst-case execution time. However, cache locking can be quite effective in improving the average-case execution time of general embedded applications as well. In this paper, we explore static instruction cache locking to improve average-case program performance. We introduce temporal reuse profile to accurately and efficiently model the cost and benefit of locking memory blocks in the cache. We propose an optimal algorithm and a heuristic approach that use the temporal reuse profile to determine the most beneficial memory blocks to be locked in the cache. Experimental results show that locking heuristic achieves close to optimal results and can improve the cache miss rate by up to 24% across a suite of real-world benchmarks. Moreover, our heuristic provides significant improvement compared to the state-of-the-art locking algorithm both in terms of performance and efficiency.

[1]  Jan Reineke,et al.  Timing predictability of cache replacement policies , 2007, Real-Time Systems.

[2]  Norman P. Jouppi,et al.  CACTI: an enhanced cache access and cycle time model , 1996, IEEE J. Solid State Circuits.

[3]  Henrik Theiling,et al.  Compile-time decided instruction cache locking using worst-case execution paths , 2007, 2007 5th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[4]  Peter Marwedel,et al.  WCET-aware static locking of instruction caches , 2012, CGO '12.

[5]  Minming Li,et al.  Instruction cache locking for multi-task real-time embedded systems , 2011, Real-Time Systems.

[6]  Peter Marwedel,et al.  Dynamic overlay of scratchpad memory for energy minimization , 2004, International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004..

[7]  Víctor Viñals,et al.  Improving the WCET computation in the presence of a lockable instruction cache in multitasking real-time systems , 2011, J. Syst. Archit..

[8]  Nikil D. Dutt,et al.  Automatic tuning of two-level caches to embedded applications , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[9]  Sharad Malik,et al.  Cache modeling for real-time software: beyond direct mapped instruction caches , 1996, 17th IEEE Real-Time Systems Symposium.

[10]  Erik Hagersten,et al.  StatCache: a probabilistic approach to efficient and accurate data locality analysis , 2004, IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004.

[11]  Xianfeng Li,et al.  Chronos: A timing analyzer for embedded software , 2007, Sci. Comput. Program..

[12]  Björn Lisper,et al.  Data cache locking for higher program predictability , 2003, SIGMETRICS '03.

[13]  Frank Vahid,et al.  A highly configurable cache architecture for embedded systems , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[14]  Yun Liang,et al.  Timing analysis of concurrent programs running on shared cache multi-cores , 2009, 2009 30th IEEE Real-Time Systems Symposium.

[15]  Yutao Zhong,et al.  Predicting whole-program locality through reuse distance analysis , 2003, PLDI.

[16]  Tulika Mitra,et al.  Exploring locking & partitioning for predictable shared caches on multi-cores , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[17]  Richard T. Witek,et al.  A 160 MHz 32 b 0.5 W CMOS RISC microprocessor , 1996, 1996 IEEE International Solid-State Circuits Conference. Digest of TEchnical Papers, ISSCC.

[18]  Trevor N. Mudge,et al.  Trace-driven memory simulation: a survey , 1997, CSUR.

[19]  Minming Li,et al.  Minimizing WCET for Real-Time Embedded Systems via Static Instruction Cache Locking , 2009, 2009 15th IEEE Real-Time and Embedded Technology and Applications Symposium.

[20]  Ieee Circuits,et al.  IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems information for authors , 2018, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[21]  Richard A. Uhlig,et al.  Trap-driven memory simulation , 1995 .

[22]  Michael D. Smith,et al.  Procedure placement using temporal-ordering information , 1999, TOPL.

[23]  Nikil D. Dutt,et al.  SPMVisor: Dynamic scratchpad memory virtualization for secure, low power, and high performance distributed on-chip memories , 2011, 2011 Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[24]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[25]  Isabelle Puaut,et al.  Low-complexity algorithms for static cache locking in multitasking hard real-time systems , 2002, 23rd IEEE Real-Time Systems Symposium, 2002. RTSS 2002..

[26]  Guang R. Gao,et al.  Improving power efficiency with compiler-assisted cache replacement , 2005, J. Embed. Comput..

[27]  Kristof Beyls,et al.  Reuse Distance as a Metric for Cache Behavior. , 2001 .

[28]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[29]  Minming Li,et al.  Instruction Cache Locking for Embedded Systems using Probability Profile , 2012, J. Signal Process. Syst..

[30]  Yun Liang,et al.  Improved procedure placement for set associative caches , 2010, CASES '10.

[31]  Rajeev Barua,et al.  Instruction cache locking inside a binary rewriter , 2009, CASES '09.

[32]  Henrik Theiling,et al.  Fast and Precise WCET Prediction by Separated Cache and Path Analyses , 2000, Real-Time Systems.

[33]  Jeffrey K. Hollingsworth,et al.  An API for Runtime Code Patching , 2000, Int. J. High Perform. Comput. Appl..

[34]  Michael F. P. O'Boyle,et al.  MiDataSets: Creating the Conditions for a More Realistic Evaluation of Iterative Optimization , 2007, HiPEAC.

[35]  Yun Liang,et al.  Static analysis for fast and accurate design space exploration of caches , 2008, CODES+ISSS '08.

[36]  Yun Liang,et al.  WCET-centric partial instruction cache locking , 2012, DAC Design Automation Conference 2012.