Software-assisted cache mechanisms for embedded systems

Embedded systems are increasingly using on-chip caches as part of their on-chip memory system. This thesis presents cache mechanisms to improve cache performance and provide opportunities to improve data availability that can lead to more predictable cache performance. The first cache mechanism presented is an intelligent cache replacement policy that utilizes information about dead data and data that is very frequently used. This mechanism is analyzed theoretically to show that the number of misses using intelligent cache replacement is guaranteed to be no more than the number of misses using traditional LRU replacement. Hardware and software-assisted mechanisms to implement intelligent cache replacement are presented and evaluated. The second cache mechanism presented is that of cache partitioning which exploits disjoint access sequences that do not overlap in the memory space. A theoretical result is proven that shows that modifying an access sequence into a concatenation of disjoint access sequences is guaranteed to improve the cache hit rate. Partitioning mechanisms inspired by the concept of disjoint sequences are designed and evaluated. A profile-based analysis, annotation, and simulation framework has been implemented to evaluate the cache mechanisms. This framework takes a compiled benchmark program and a set of program inputs and evaluates various cache mechanisms to provide a range of possible performance improvement scenarios. The proposed cache mechanisms have been evaluated using this framework by measuring cache miss rates and Instructions Per Clock (IPC) information. The results show that the proposed cache mechanisms show promise in improving cache performance and predictability with a modest increase in silicon area. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

[1]  S. Kim,et al.  Fair cache sharing and partitioning in a chip multiprocessor architecture , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[2]  Israel Koren,et al.  The minimax cache: an energy-efficient framework for media processors , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[3]  Mateo Valero,et al.  Software management of selective and dual data caches , 1997 .

[4]  Dean M. Tullsen,et al.  Hardware identification of cache conflict misses , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[5]  Arnold L. Rosenberg,et al.  Using the compiler to improve cache replacement decisions , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[6]  David J. Lilja,et al.  A compiler-assisted data prefetch controller , 1999, Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040).

[7]  Steven K. Reinhardt,et al.  A fully associative software-managed cache design , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[8]  Miodrag Potkonjak,et al.  Application-driven synthesis of core-based systems , 1997, 1997 Proceedings of IEEE International Conference on Computer Aided Design (ICCAD).

[9]  Steven Przybylski The performance impact of block sizes and fetch strategies , 1990, ISCA '90.

[10]  Brad Calder,et al.  Reducing cache misses using hardware and software page placement , 1999, ICS '99.

[11]  William J. Dally,et al.  Smart Memories: a modular reconfigurable architecture , 2000, ISCA '00.

[12]  Todd C. Mowry,et al.  Taming the memory hogs: using compiler-inserted releases to manage physical memory intelligently , 2000, OSDI.

[13]  Jun Yang,et al.  Lightweight set buffer: low power data cache for multimedia application , 2003, ISLPED '03.

[14]  David A. Padua,et al.  Estimating cache misses and locality using stack distances , 2003, ICS '03.

[15]  Kenneth C. Yeager The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[16]  R. E. Kessler,et al.  The Alpha 21264 Microprocessor: Out-of-Order Execution at 600 Mhz , 1998 .

[17]  Brad Calder,et al.  Quantifying load stream behavior , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[18]  David J. Sager,et al.  The microarchitecture of the Pentium 4 processor , 2001 .

[19]  Peter Petrov,et al.  Performance and power effectiveness in embedded processors customizable partitioned caches , 2001, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[20]  N. Maki,et al.  A data-replace-controlled cache memory system and its performance evaluations , 1999, Proceedings of IEEE. IEEE Region 10 Conference. TENCON 99. 'Multimedia Technology for Asia-Pacific Information Infrastructure' (Cat. No.99CH37030).

[21]  Anant Agarwal,et al.  Column-associative caches: a technique for reducing the miss rate of direct-mapped caches , 1993, ISCA '93.

[22]  James R. Larus,et al.  Cache-conscious structure layout , 1999, PLDI '99.

[23]  Israel Koren,et al.  Cool-Cache: A compiler-enabled energy efficient data caching framework for embedded/multimedia processors , 2003, TECS.

[24]  David J. Lilja,et al.  When Caches Aren't Enough: Data Prefetching Techniques , 1997, Computer.

[25]  Gilles Pokam,et al.  Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems , 2004, Eighth Workshop on Interaction between Compilers and Computer Architectures, 2004. INTERACT-8 2004..

[26]  Gary S. Tyson,et al.  Utilizing reuse information in data cache management , 1998, ICS '98.

[27]  Chen Ding,et al.  Miss rate prediction across all program inputs , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[28]  Yale N. Patt,et al.  An effective programmable prefetch engine for on-chip caches , 1995, MICRO 1995.

[29]  Yutao Zhong,et al.  Predicting whole-program locality through reuse distance analysis , 2003, PLDI.

[30]  James R. Larus,et al.  Cache-conscious structure definition , 1999, PLDI '99.

[31]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[32]  Henry M. Levy,et al.  An Architecture for Software-Controlled Data Prefetching , 1991, ISCA.

[33]  Mark D. Hill,et al.  A case for direct-mapped caches , 1988, Computer.

[34]  Krste Asanovic,et al.  Direct addressed caches for reduced power consumption , 2001, MICRO.

[35]  K. Kavi Cache Memories Cache Memories in Uniprocessors. Reading versus Writing. Improving Performance , 2022 .

[36]  Michel Dubois,et al.  Self-correcting LRU replacement policies , 2004, CF '04.

[37]  Jih-Kwon Peir,et al.  Capturing dynamic memory reference behavior with adaptive cache topology , 1998, ASPLOS VIII.

[38]  James R. Goodman,et al.  Hardware techniques to improve the performance of the processor/memory interface , 1998 .

[39]  Alexander V. Veidenbaum,et al.  Reducing power consumption for high-associativity data caches in embedded processors , 2003, 2003 Design, Automation and Test in Europe Conference and Exhibition.

[40]  Kathryn S. McKinley,et al.  Guided region prefetching: a cooperative hardware/software approach , 2003, ISCA '03.

[41]  Srinivas Devadas,et al.  A Code Reordering Transformation for Improved Cache Performance , 2001 .

[42]  Wen-mei W. Hwu,et al.  Run-time Adaptive Cache Hierarchy Via Reference Analysis , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[43]  John Turek,et al.  Optimal Partitioning of Cache Memory , 1992, IEEE Trans. Computers.

[44]  Kathryn S. McKinley,et al.  Cooperative caching with keep-me and evict-me , 2005, 9th Annual Workshop on Interaction between Compilers and Computer Architectures (INTERACT'05).

[45]  Sang Lyul Min,et al.  On the existence of a spectrum of policies that subsumes the least recently used (LRU) and least frequently used (LFU) policies , 1999, SIGMETRICS '99.

[46]  Margaret Martonosi,et al.  TCP: tag correlating prefetchers , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[47]  George Murillo,et al.  Enhancing Data Cache Performance via Dynamic Allocation , 2003 .

[48]  Cathy May,et al.  The PowerPC Architecture: A Specification for a New Family of RISC Processors , 1994 .

[49]  Hiroyuki Tomiyama,et al.  Code placement techniques for cache miss rate reduction , 1997, TODE.

[50]  Ravi R. Iyer,et al.  CQoS: a framework for enabling QoS in shared caches of CMP platforms , 2004, ICS '04.

[51]  Csaba Andras Moritz,et al.  Cool-Mem: combining statically speculative memory accessing with selective address translation for energy efficiency , 2002, ASPLOS X.

[52]  Yannis Smaragdakis,et al.  EELRU: simple and effective adaptive page replacement , 1999, SIGMETRICS '99.

[53]  Harish Patil,et al.  Profile-guided post-link stride prefetching , 2002, ICS '02.

[54]  D. Burger,et al.  Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[55]  Mahmut T. Kandemir,et al.  Partitioned instruction cache architecture for energy efficiency , 2003, TECS.

[56]  Todd C. Mowry,et al.  Compiler-directed page coloring for multiprocessors , 1996, ASPLOS VII.

[57]  Gary S. Tyson,et al.  Region-based caching: an energy-delay efficient memory architecture for embedded processors , 2000, CASES '00.

[58]  Tien-Fu Chen,et al.  Alternative implementations of hybrid branch predictors , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[59]  M. Schulz,et al.  Identifying and Exploiting Spatial Regularity in Data Memory References , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[60]  Per Stenström,et al.  A prefetching technique for irregular accesses to linked data structures , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[61]  Rajesh K. Gupta,et al.  Adapting cache line size to application behavior , 1999, ICS '99.

[62]  Seh-Woong Jeong,et al.  Reducing cache pollution of prefetching in a small data cache , 2001, Proceedings 2001 IEEE International Conference on Computer Design: VLSI in Computers and Processors. ICCD 2001.

[63]  T. Ozawa,et al.  Cache miss heuristics and preloading techniques for general-purpose programs , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[64]  Sanjeev Kumar,et al.  Exploiting spatial locality in data caches using spatial footprints , 1998, ISCA.

[65]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[66]  Jean-Loup Baer,et al.  Modified LRU policies for improving second-level cache behavior , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[67]  Srinivas Devadas,et al.  Application-specific memory management for embedded systems using software-controlled caches , 2000, Proceedings 37th Design Automation Conference.

[68]  Frank Vahid,et al.  A highly configurable cache architecture for embedded systems , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[69]  Richard E. Kessler,et al.  Evaluating stream buffers as a secondary cache replacement , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[70]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.

[71]  Steve Carr,et al.  Reuse-distance-based miss-rate prediction on a per instruction basis , 2004, MSP '04.

[72]  Anna R. Karlin,et al.  A study of integrated prefetching and caching strategies , 1995, SIGMETRICS '95/PERFORMANCE '95.

[73]  Dean M. Tullsen,et al.  Runtime identification of cache conflict misses: The adaptive miss buffer , 2001, TOCS.

[74]  Yale N. Patt,et al.  The V-Way Cache: Demand Based Associativity via Global Replacement , 2005, ISCA 2005.

[75]  Alexander V. Veidenbaum,et al.  An Integrated Hardware/Software Data Prefetching Scheme for Shared-Memory Multiprocessors1 , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[76]  Jean-Loup Baer,et al.  Effective Hardware Based Data Prefetching for High-Performance Processors , 1995, IEEE Trans. Computers.

[77]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[78]  Wei Zhang,et al.  A compiler approach for reducing data cache energy , 2003, ICS '03.

[79]  Laszlo A. Belady,et al.  A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..

[80]  Sally A. McKee,et al.  Design and evaluation of dynamic access ordering hardware , 1996, ICS '96.

[81]  S. M. Shahrier,et al.  On predictability and optimization of multiprogrammed caches for real-time applications , 1997, 1997 IEEE International Performance, Computing and Communications Conference.

[82]  Olivier Temam,et al.  An Algorithm for Optimally Exploiting Spatial and Temporal Locality in Upper Memory Levels , 1999, IEEE Trans. Computers.

[83]  Norman P. Jouppi,et al.  Reconfigurable caches and their application to media processing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[84]  Mikko H. Lipasti,et al.  Cache miss heuristics and preloading techniques for general-purpose programs , 1995, MICRO 28.

[85]  The Split Spatial / Non-Spatial Cache : A Performance and Complexity Evaluation 0 LORã 3 UYXORYLü ' DUNR 0 DULQRY = RUDQ ' LPLWULMHYLü 9 HOMNR 0 , 1999 .

[86]  G. Edward Suh,et al.  Analytical cache models with applications to cache partitioning , 2001, ICS '01.

[87]  Wen-mei W. Hwu,et al.  Run-time spatial locality detection and optimization , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[88]  Alan Jay Smith,et al.  Evaluating Associativity in CPU Caches , 1989, IEEE Trans. Computers.

[89]  Israel Koren,et al.  Cool-cache for hot multimedia , 2001, MICRO.

[90]  Michel Dubois,et al.  Optimal replacements in caches with two miss costs , 1999, SPAA '99.

[91]  Mahmut T. Kandemir,et al.  Power-aware partitioned cache architectures , 2001, ISLPED '01.

[92]  Sharad Malik,et al.  Precise miss analysis for program transformations with caches of arbitrary associativity , 1998, ASPLOS VIII.

[93]  Srinivas Devadas,et al.  Software-assisted cache replacement mechanisms for embedded systems , 2001, IEEE/ACM International Conference on Computer Aided Design. ICCAD 2001. IEEE/ACM Digest of Technical Papers (Cat. No.01CH37281).

[94]  Ana Pont,et al.  The filter cache: a run-time cache management approach , 1999, Proceedings 25th EUROMICRO Conference. Informatics: Theory and Practice for the New Millennium.

[95]  Carole Dulong,et al.  The IA-64 Architecture at Work , 1998, Computer.

[96]  Francky Catthoor,et al.  Fast and extensive system-level memory exploration for ATM applications , 1997, Proceedings. Tenth International Symposium on System Synthesis (Cat. No.97TB100114).

[97]  Babak Falsafi,et al.  Exploiting choice in resizable cache design to optimize deep-submicron processor energy-delay , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[98]  Margaret Martonosi,et al.  Timekeeping in the memory system: predicting and optimizing memory behavior , 2002, ISCA.

[99]  Wei-Chung Hsu,et al.  Data Prefetching On The HP PA-8000 , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[100]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[101]  Michel Dubois,et al.  Cost-sensitive cache replacement algorithms , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[102]  Aamer Jaleel,et al.  Adaptive insertion policies for high performance caching , 2007, ISCA '07.

[103]  Gary S. Tyson,et al.  Active Management of Data Caches by Exploiting Reuse Information , 1999, IEEE Trans. Computers.

[104]  Babak Falsafi,et al.  Selective, accurate, and timely self-invalidation using last-touch prediction , 2000, ISCA '00.

[105]  Jun Yang,et al.  Low cost instruction cache designs for tag comparison elimination , 2003, ISLPED '03.

[106]  Siddhartha Chatterjee,et al.  Exact analysis of the cache behavior of nested loops , 2001, PLDI '01.

[107]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[108]  Antonio González,et al.  A locality sensitive multi-module cache with explicit management , 1999, ICS '99.

[109]  Harold S. Stone,et al.  Improving Disk Cache Hit-Ratios Through Cache Partitioning , 1992, IEEE Trans. Computers.

[110]  Sally A. McKee,et al.  Smarter Memory: Improving Bandwidth for Streamed References , 1998, Computer.

[111]  Jignesh M. Patel,et al.  Data prefetching by dependence graph precomputation , 2001, ISCA 2001.

[112]  Gary S. Tyson,et al.  A modified approach to data cache management , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[113]  Brian N. Bershad,et al.  Avoiding conflict misses dynamically in large direct-mapped caches , 1994, ASPLOS VI.

[114]  Wayne H. Wolf,et al.  A task-level hierarchical memory model for system synthesis of multiprocessors , 1997, DAC.

[115]  Srinivas Devadas,et al.  Controlling Cache Pollution in Prefetching With Software-assisted Cache Replacement , 2005 .

[116]  Yale N. Patt,et al.  Partitioned first-level cache design for clustered microarchitectures , 2003, ICS '03.

[117]  Babak Falsafi,et al.  Dead-block prediction & dead-block correlating prefetchers , 2001, ISCA 2001.

[118]  Alexandru Nicolau,et al.  Memory Issues in Embedded Systems-on-Chip: Optimizations and Exploration , 1998 .

[119]  David J. Lilja,et al.  Data prefetch mechanisms , 2000, CSUR.

[120]  Mahmut T. Kandemir,et al.  A matrix-based approach to the global locality optimization problem , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[121]  Wen-mei W. Hwu,et al.  Run-Time Cache Bypassing , 1999, IEEE Trans. Computers.

[122]  Babak Falsafi,et al.  Accurate and complexity-effective spatial pattern prediction , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[123]  Mikko H. Lipasti,et al.  Partial resolution in branch target buffers , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[124]  Yale N. Patt,et al.  The V-Way cache: demand-based associativity via global replacement , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[125]  Björn Lisper,et al.  Data cache locking for higher program predictability , 2003, SIGMETRICS '03.

[126]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor , 1999, IEEE Micro.

[127]  Scott McFarling,et al.  Program optimization for instruction caches , 1989, ASPLOS III.

[128]  Santosh G. Abraham,et al.  Efficient simulation of caches under optimal replacement with applications to miss characterization , 1993, SIGMETRICS '93.

[129]  James R. Larus,et al.  Using generational garbage collection to implement cache-conscious data placement , 1998, ISMM '98.

[130]  Chandra Krintz,et al.  Cache-conscious data placement , 1998, ASPLOS VIII.

[131]  Emmett Witchel The Span Cache: Software Controlled Tag Checks and Cache Line Size , 2001 .

[132]  Jim Zelenka,et al.  Informed prefetching and caching , 1995, SOSP.

[133]  Todd C. Mowry,et al.  Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.

[134]  C. M. Krishna,et al.  Cool-cache for hot multimedia , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[135]  Kathryn S. McKinley,et al.  Combining Cooperative Software / Hardware Prefetching and Cache Replacment , 2004 .

[136]  Peter Petrov,et al.  Towards effective embedded processors in codesigns: customizable partitioned caches , 2001, Ninth International Symposium on Hardware/Software Codesign. CODES 2001 (IEEE Cat. No.01TH8571).