Extending the effectiveness of 3D-stacked DRAM caches with an adaptive multi-queue policy

3D-integration is a promising technology to help combat the ¿memory wall¿ in future multi-core processors. Past work has considered using 3D-stacked DRAM as a large last-level cache (LLC). While significant performance benefits can be gained with such an approach, there remain additional opportunities beyond the simple integration of commodity DRAM chips. In this work, we leverage the hardware organization typical of DRAM architectures to propose new cache management policies that would otherwise not be practical for standard SRAM-based caches. We propose a cache where each set is organized as multiple logical FIFO or queue structures that simultaneously provide performance isolation between threads as well as reduce the number of entries occupied by dead lines. Our results show that beyond the simplistic approach of stacking DRAM as cache, such tightly-integrated 3D architectures enable new opportunities for optimizing and improving system performance.

[1]  G. Edward Suh,et al.  Dynamic Partitioning of Shared Cache Memory , 2004, The Journal of Supercomputing.

[2]  Yuan Xie,et al.  Processor Design in 3D Die-Stacking Technologies , 2007, IEEE Micro.

[3]  Gabriel H. Loh,et al.  Zesto: A cycle-level simulator for highly detailed microarchitecture exploration , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[4]  C. Nicopoulos,et al.  Design and Management of 3D Chip Multiprocessors Using Network-in-Memory , 2006, ISCA 2006.

[5]  Rajeev Balasubramonian,et al.  Optimizing communication and capacity in a 3D stacked reconfigurable cache hierarchy , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[6]  Krisztián Flautner,et al.  PicoServer: using 3D stacking technology to enable a compact energy efficient chip multiprocessor , 2006, ASPLOS XII.

[7]  Rajeev Balasubramonian,et al.  Leveraging 3D Technology for Improved Reliability , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[8]  Avi Mendelson,et al.  Trace Cache Sampling Filter , 2005, IEEE PACT.

[9]  Greg Hamerly,et al.  SimPoint 3.0: Faster and More Flexible Program Analysis , 2005 .

[10]  Lei Jiang,et al.  Die Stacking (3D) Microarchitecture , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[11]  Onur Mutlu,et al.  A Case for MLP-Aware Cache Replacement , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[12]  Gabriel H. Loh,et al.  3D-Stacked Memory Architectures for Multi-core Processors , 2008, 2008 International Symposium on Computer Architecture.

[13]  Hsien-Hsin S. Lee,et al.  Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[14]  Jack Doweck,et al.  Inside Intel® Core microarchitecture , 2006, 2006 IEEE Hot Chips 18 Symposium (HCS).

[15]  Yan Solihin,et al.  Fair cache sharing and partitioning in a chip multiprocessor architecture , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[16]  John Turek,et al.  Optimal Partitioning of Cache Memory , 1992, IEEE Trans. Computers.

[17]  Steven K. Reinhardt,et al.  A fully associative software-managed cache design , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[18]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[19]  Martin Burtscher,et al.  Bridging the processor-memory performance gap with 3D IC technology , 2005, IEEE Design & Test of Computers.

[20]  Jason Cong,et al.  An automated design flow for 3D microarchitecture evaluation , 2006, Asia and South Pacific Conference on Design Automation, 2006..

[21]  Kaustav Banerjee,et al.  A thermally-aware performance analysis of vertically integrated (3-D) processor-memory hierarchy , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[22]  Jun Yang,et al.  A low-radix and low-diameter 3D interconnection network design , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[23]  P. Reed,et al.  Design aspects of a microprocessor data cache using 3D die interconnect technology , 2005, 2005 International Conference on Integrated Circuit Design and Technology, 2005. ICICDT 2005..

[24]  Jichuan Chang,et al.  Cooperative cache partitioning for chip multiprocessors , 2007, ICS '07.

[25]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[26]  Narayanan Vijaykrishnan,et al.  Three-dimensional cache design exploration using 3DCacti , 2005, 2005 International Conference on Computer Design.

[27]  Aamer Jaleel,et al.  Adaptive insertion policies for managing shared caches , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[28]  Zhao Zhang,et al.  Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[29]  S. McFarling Combining Branch Predictors , 1993 .

[30]  Won-Taek Lim,et al.  Architectural support for operating system-driven CMP cache management , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[31]  Chita R. Das,et al.  A novel dimensionally-decomposed router for on-chip communication in 3D architectures , 2007, ISCA '07.

[32]  James R. Goodman,et al.  The declining effectiveness of dynamic caching for general- purpose microprocessors , 1995 .

[33]  Gabriel H. Loh,et al.  Implementing caches in a 3D technology for high performance processors , 2005, 2005 International Conference on Computer Design.

[34]  Yan Solihin,et al.  Predicting inter-thread cache contention on a chip multi-processor architecture , 2005, 11th International Symposium on High-Performance Computer Architecture.

[35]  Srihari Makineni,et al.  Communist, Utilitarian, and Capitalist cache policies on CMPs: Caches as a shared resource , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[36]  Manoj Franklin,et al.  Balancing thoughput and fairness in SMT processors , 2001, 2001 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS..

[37]  Dean M. Tullsen,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[38]  Brad Calder,et al.  SimPoint 3.0: Faster and More Flexible Program Phase Analysis , 2005, J. Instr. Level Parallelism.

[39]  Jaehyuk Huh,et al.  Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[40]  Chita R. Das,et al.  MIRA: A Multi-layered On-Chip Interconnect Router Architecture , 2008, 2008 International Symposium on Computer Architecture.