Split array and scalar data caches: a comprehensive study of data cache organization

Existing cache organization suffers from the inability to distinguish different types of localities, and non-selectively cache all data rather than making any attempt to take special advantage of the locality type. This causes unnecessary movement of data among the levels of the memory hierarchy and increases in miss ratio. In this dissertation I propose a split data cache architecture that will group memory accesses as scalar or array references according to their inherent locality and will subsequently map each group to a dedicated cache partition. In this system, because scalar and array references will no longer negatively affect each other, cache-interference is diminished, delivering better performance. Further improvement is achieved by the introduction of victim cache, prefetching, data flattening and reconfigurability to tune the array and scalar caches for specific application. The most significant contribution of my work is the introduction of novel cache architecture for embedded microprocessor platforms. My proposed cache architecture uses reconfigurability coupled with split data caches to reduce area and power consumed by cache memories while retaining performance gains. My results show excellent reductions in both memory size and memory access times, translating into reduced power consumption. Since there was a huge reduction in miss rates at L-1 caches, further power reduction is achieved by partially or completely shutting down L-2 data or L-2 instruction caches. The saving in cache sizes resulting from these designs can be used for other processor activities including instruction and data prefetching, branch-prediction buffers. The potential benefits of such techniques for embedded applications have been evaluated in my work. I also explore how my cache organization performs for non-numeric data structures. I propose a novel idea called “Data flattening” which is a profile based memory allocation technique to compress sparsely scattered pointer data into regular contiguous memory locations and explore the potentials of my proposed Spit cache organization for data treated with data flattening method.

[1]  Sally A. McKee,et al.  Smarter Memory: Improving Bandwidth for Streamed References , 1998, Computer.

[2]  Norman P. Jouppi,et al.  CACTI: an enhanced cache access and cycle time model , 1996, IEEE J. Solid State Circuits.

[3]  David H. Albonesi,et al.  Selective cache ways: on-demand cache resource allocation , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[4]  Wen-mei W. Hwu,et al.  Run-time Adaptive Cache Hierarchy Via Reference Analysis , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[5]  Afrin Naz,et al.  A Study of Separate Array and Scalar Caches , 2004, HPCS.

[6]  Chandra Krintz,et al.  Cache-conscious data placement , 1998, ASPLOS VIII.

[7]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[8]  Edward S. Davidson,et al.  Reducing conflicts in direct-mapped caches with a temporality-based design , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[9]  Charles C. Weems,et al.  Application-adaptive intelligent cache memory system , 2002, TECS.

[10]  Kimming So,et al.  Cache Operations by MRU Change , 1988, IEEE Trans. Computers.

[11]  James R. Larus,et al.  Cache-conscious structure definition , 1999, PLDI '99.

[12]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[13]  Edward McLellan The Alpha AXP architecture and 21064 processor , 1993, IEEE Micro.

[14]  Mateo Valero,et al.  A victim cache for vector registers , 1997, ICS '97.

[15]  Anne Rogers,et al.  Supporting dynamic data structures on distributed-memory machines , 1995, TOPL.

[16]  Norman P. Jouppi,et al.  Reconfigurable caches and their application to media processing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[17]  Alan Jay Smith,et al.  Cache Memories , 1982, CSUR.

[18]  Véronique Benzaken,et al.  Enhancing Performance in a Persistent Object Store: Clustering Strategies in O2 , 1990, POS.

[19]  Frank Vahid,et al.  A highly configurable cache architecture for embedded systems , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[20]  Richard E. Kessler,et al.  Evaluating stream buffers as a secondary cache replacement , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[21]  Todd C. Mowry,et al.  Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.

[22]  Tack-Don Han,et al.  A Power Efficient Cache Structure for Embedded Processors Based on the Dual Cache Structure , 2000, LCTES.

[23]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.

[24]  Krishna M. Kavi,et al.  Design of cache memories for dataflow architecture , 1998, J. Syst. Archit..

[25]  Mateo Valero,et al.  Software management of selective and dual data caches , 1997 .

[26]  A. Argawal,et al.  Cache performance of operating systems and multiprogramming , 1988 .

[27]  Todd C. Mowry,et al.  Memory forwarding: enabling aggressive layout optimizations by guaranteeing the safety of data relocation , 1999, ISCA.

[28]  Anant Agarwal,et al.  Column-associative caches: a technique for reducing the miss rate of direct-mapped caches , 1993, ISCA '93.

[29]  Kanad Ghose,et al.  Energy-efficiency of VLSI caches: a comparative study , 1997, Proceedings Tenth International Conference on VLSI Design.

[30]  Peter Petrov,et al.  Towards effective embedded processors in codesigns: customizable partitioned caches , 2001, Ninth International Symposium on Hardware/Software Codesign. CODES 2001 (IEEE Cat. No.01TH8571).

[31]  Jim D. Garside,et al.  An asynchronous victim cache , 2002, Proceedings Euromicro Symposium on Digital System Design. Architectures, Methods and Tools.

[32]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[33]  Hugo De Man,et al.  Cache conscious data layout organization for conflict miss reduction in embedded multimedia applications , 2005, IEEE Transactions on Computers.

[34]  Antonio Gonzalez,et al.  A data cache with multiple caching strategies tuned to different types of locality , 1995, International Conference on Supercomputing.

[35]  Krishna M. Kavi,et al.  Cache Performance of Scheduled Dataflow Architecture , 2000 .

[36]  Douglas Comer,et al.  Ubiquitous B-Tree , 1979, CSUR.

[37]  Afrin Naz,et al.  A Study of Reconfigurable Split Data Caches and Instruction Caches , 2006, PDCS.

[38]  Jang-Soo Lee,et al.  A new cache architecture based on temporal and spatial locality , 2000, J. Syst. Archit..

[39]  Frank Vahid,et al.  Using a victim buffer in an application-specific memory hierarchy , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[40]  Bruce Jacob,et al.  Cache Design for Embedded Real-Time Systems , 1999 .

[41]  Jörg Henkel,et al.  Interface and cache power exploration for core-based embedded system design , 1999, ICCAD 1999.

[42]  Paul R. Wilson,et al.  Object Type Directed Garbage Collection To Improve Locality , 1992, IWMM.

[43]  Krishna M. Kavi,et al.  Performance enhancement by eliminating redundant function execution , 2006, 39th Annual Simulation Symposium (ANSS'06).

[44]  Rajeev Balasubramonian,et al.  Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures , 2000, MICRO 33.

[45]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[46]  Amitabh Srivastava,et al.  Analysis Tools , 2019, Public Transportation Systems.

[47]  Frank Vahid,et al.  Synthesis of customized loop caches for core-based embedded systems , 2002, ICCAD 2002.

[48]  Nikil D. Dutt,et al.  Automatic tuning of two-level caches to embedded applications , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[49]  Dirk Grunwald,et al.  A comparison of software code reordering and victim buffers , 1999, CARN.

[50]  James E. Smith,et al.  The predictability of data values , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[51]  Kaushik Roy,et al.  DRG-cache: a data retention gated-ground cache for low power , 2002, DAC '02.

[52]  Chris J. Cheney A nonrecursive list compacting algorithm , 1970, Commun. ACM.

[53]  Dirk Grunwald,et al.  Predictive sequential associative cache , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[54]  Chau-Wen Tseng,et al.  Compiler optimizations for improving data locality , 1994, ASPLOS VI.

[55]  Krishna M. Kavi,et al.  Design of cache memories for multi-threaded dataflow architecture , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[56]  Afrin Naz,et al.  Making a case for split data caches for embedded applications , 2006, SIGARCH Comput. Archit. News.

[57]  Srinivas Devadas,et al.  Software-assisted cache replacement mechanisms for embedded systems , 2001, IEEE/ACM International Conference on Computer Aided Design. ICCAD 2001. IEEE/ACM Digest of Technical Papers (Cat. No.01CH37281).

[58]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[59]  Afrin Naz,et al.  Improving data cache performance with integrated use of split caches, victim cache and stream buffers , 2005, SIGARCH Comput. Archit. News.

[60]  R. Rajamani,et al.  A CMOS RISC CPU with on-chip parallel cache , 1994, Proceedings of IEEE International Solid-State Circuits Conference - ISSCC '94.

[61]  J. Banerjee,et al.  Clustering a DAG for CAD Databases , 1988, IEEE Trans. Software Eng..

[62]  Wen-mei W. Hwu,et al.  Run-time spatial locality detection and optimization , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[63]  John L. Henning SPEC CPU2000: Measuring CPU Performance in the New Millennium , 2000, Computer.

[64]  Kanad Ghose,et al.  Analytical energy dissipation models for low-power caches , 1997, ISLPED '97.

[65]  Matthew L. Seidl,et al.  Segregating heap objects by reference behavior and lifetime , 1998, ASPLOS VIII.

[66]  David A. Moon,et al.  Garbage collection in a large LISP system , 1984, LFP '84.

[67]  Israel Koren,et al.  The minimax cache: an energy-efficient framework for media processors , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[68]  Frank Vahid,et al.  Energy benefits of a configurable line size cache for embedded systems , 2003, IEEE Computer Society Annual Symposium on VLSI, 2003. Proceedings..

[69]  James R. Larus,et al.  Using generational garbage collection to implement cache-conscious data placement , 1998, ISMM '98.

[70]  Lee Jung-Hoon,et al.  An energy efficient cache memory architecture for embedded systems , 2004, SAC '04.

[71]  Ken Chan,et al.  PA7200: a PA-RISC processor with integrated high performance MP bus interface , 1994, Proceedings of COMPCON '94.

[72]  G.S. Sohi,et al.  Dynamic instruction reuse , 1997, ISCA '97.