FusedCache: A Naturally Inclusive, Racetrack Memory, Dual-Level Private Cache

We propose FusedCache, a two-level set-associative Racetrack memory (RM) cache design that utilizes RM's high density for providing fast uniform access at one level, and non-uniform access at the next. FusedCache is well suited for private L1/L2 caches enforcing alignment of L1 data with the RM access points with the remaining non-aligned data serving as L2. It uses traditional LRU eviction for L1 misses. Promotion and demotion between L1 and L2 are performed through shifts and, when necessary, background swap operations. These swap operations do not require physical stores or loads, making accesses both faster and more energy efficient. Further, unlike a traditional inclusive cache hierarchy, fused L1 cache lines naturally exist in L2 avoiding duplicated storage and tag structures, promotions, and evictions. L1 status on each track is strictly enforced by track LRU maintenance and background swapping. Our results demonstrate that compared to an iso-area L1 SRAM cache replacement, FusedCache improves application performance by 7 percent while reducing cache energy by 33 percent. Compared to an iso-capacity two level (L1/L2) SRAM cache replacement, FusedCache provides similar performance with a dramatic 69 percent cache energy reduction. Compared to a TapeCache L1 scheme, FusedCache gains a 7 percent performance improvement with a 6 percent cache energy saving which translates to a 13 percent improvement in energy-delay product.

[1]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[2]  Jichuan Chang,et al.  Cooperative cache partitioning for chip multiprocessors , 2007, ICS '07.

[3]  Lieven Eeckhout,et al.  Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[4]  K. Roy,et al.  Numerical analysis of domain wall propagation for dense memory arrays , 2011, 2011 International Electron Devices Meeting.

[5]  Babak Falsafi,et al.  Exploiting choice in resizable cache design to optimize deep-submicron processor energy-delay , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[6]  Rami G. Melhem,et al.  ContextPreRF: Enhancing the Performance and Energy of GPUs With Nonuniform Register Access , 2016, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[7]  Wenqing Wu,et al.  Cross-layer racetrack memory design for ultra high density and low power consumption , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[8]  Weisheng Zhao,et al.  Perpendicular-magnetic-anisotropy CoFeB racetrack memory , 2012 .

[9]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[10]  Jacques-Olivier Klein,et al.  Racetrack memory based reconfigurable computing , 2013, 2013 IEEE Faible Tension Faible Consommation.

[11]  P. Chevalier,et al.  Racetrack memory cell array with integrated magnetic tunnel junction readout , 2011, 2011 International Electron Devices Meeting.

[12]  Lixin Zhang,et al.  Adaptive mechanisms and policies for managing cache hierarchies in chip multiprocessors , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[13]  Shunsuke Fukami,et al.  Micromagnetic analysis of current driven domain wall motion in nanostrips with perpendicular magnetic anisotropy , 2008 .

[14]  Kaushik Roy,et al.  DWM-TAPESTRI - An energy efficient all-spin cache using domain wall shift based writes , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[15]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[16]  Rajeev Balasubramonian,et al.  Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures , 2000, MICRO 33.

[17]  Yue Zhang,et al.  Ultra-High Density Content Addressable Memory Based on Current Induced Domain Wall Motion in Magnetic Track , 2012, IEEE Transactions on Magnetics.

[18]  Rami G. Melhem,et al.  Multilane Racetrack caches: Improving efficiency through compression and independent shifting , 2015, The 20th Asia and South Pacific Design Automation Conference.

[19]  S. Parkin,et al.  Magnetic Domain-Wall Racetrack Memory , 2008, Science.

[20]  Kaushik Roy,et al.  STAG: Spintronic-Tape Architecture for GPGPU cache hierarchies , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[21]  Rami G. Melhem,et al.  Domain-wall memory buffer for low-energy NoCs , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[22]  Tao Zhang,et al.  MorphCache: A Reconfigurable Adaptive Multi-level Cache hierarchy , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[23]  Kaushik Roy,et al.  TapeCache: a high density, energy efficient cache based on domain wall memory , 2012, ISLPED '12.

[24]  Kaushik Roy,et al.  DyReCTape: A dynamically reconfigurable cache using domain wall memory tapes , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[25]  Cong Xu,et al.  NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.