FOS: a low-power cache organization for multicores

The cache hierarchy of current multicore processors typically consists of one or two levels of private caches per core and a large shared last-level cache. This approach incurs area and energy wasting due to oversizing the private cache space, data replication through the inclusive cache levels, as well as the use of highly set-associative caches. In this paper, we claim that although this is the commonly adopted approach, it presents important design issues that can be addressed by a more energy efficient organization. This work proposes Flat On-chip Storage (FOS), a novel cache organization that, aimed at addressing energy and area on low-power processors, resolves the mentioned issues. For this purpose, FOS combines L2 and L3 cache levels into a single one, organized as a flat space, and composed of a pool of private small cache slices. These slices are initially powered off to save energy, and they are powered on and assigned to cores provided that the system performance is expected to improve. To provide fast and uniform access from the private L1 caches to the FOS’s cache slices, multiple architectural challenges are overcome, which entails the design of a custom optical network-on-chip. Experimental results show that FOS achieves significant energy savings on both static and dynamic energy over conventional cache organizations with the same storage capacity. FOS static energy savings are as much as 60% over an electrically connected shared cache; these savings grow up to 75% compared to optically connected baselines. Moreover, despite deactivating part of the cache space, FOS achieves similar performance values as those achieved by conventional approaches.

[1]  Jun Pang,et al.  Exploiting emerging technologies for nanoscale photonic networks-on-chip , 2013, NoCArc '13.

[2]  Sebastian Werner,et al.  Designing Low-Power, Low-Latency Networks-on-Chip by Optimally Combining Electrical and Optical Links , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[3]  Pedro López,et al.  Multi2Sim: A Simulation Framework to Evaluate Multicore-Multithreaded Processors , 2007, 19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07).

[4]  R. Soref,et al.  Electrooptical effects in silicon , 1987 .

[5]  Luca P. Carloni,et al.  Photonic Network-on-Chip Design , 2013, Integrated Circuits and Systems.

[6]  Nikolaos Hardavellas,et al.  Parka: Thermally Insulated Nanophotonic Interconnects , 2015, NOCS.

[7]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[8]  Rajeev Balasubramonian,et al.  Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[9]  Yan Solihin,et al.  Fair cache sharing and partitioning in a chip multiprocessor architecture , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[10]  Nathan Beckmann,et al.  Jigsaw: Scalable software-defined caches , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[11]  José González,et al.  Elastic cooperative caching: an autonomous dynamically adaptive memory hierarchy for chip multiprocessors , 2010, ISCA.

[12]  Mikko H. Lipasti,et al.  Light speed arbitration and flow control for nanophotonic interconnects , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[13]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[14]  Jean-Loup Baer,et al.  Memory hierarchy design for a multiprocessor look-up engine , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[15]  Sudeep Pasricha,et al.  METEOR: Hybrid photonic ring-mesh network-on-chip for multicore architectures , 2014, ACM Trans. Embed. Comput. Syst..

[16]  David H. Albonesi,et al.  Phastlane: a rapid transit optical routing network , 2009, ISCA '09.

[17]  Hui Chen,et al.  Predictions of CMOS compatible on-chip optical interconnect , 2005, International Workshop on System-Level Interconnect Prediction.

[18]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[19]  David A. Wood,et al.  ASR: Adaptive Selective Replication for CMP Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[20]  Ahmed Louri,et al.  Dynamic Reconfiguration of 3D Photonic Networks-on-Chip for Maximizing Performance and Improving Fault Tolerance , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[21]  Sangyeun Cho,et al.  Managing Distributed, Shared L2 Caches through OS-Level Page Allocation , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[22]  Jaehyuk Huh,et al.  A NUCA Substrate for Flexible CMP Cache Sharing , 2007, IEEE Transactions on Parallel and Distributed Systems.

[23]  Per Stenström,et al.  An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[24]  José González,et al.  Distributed Cooperative Caching , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[25]  Antonio García-Guirado,et al.  Managing resources dynamically in hybrid photonic‐electronic networks‐on‐chip , 2014, Concurr. Comput. Pract. Exp..

[26]  Vicent Selfa,et al.  Improving System Turnaround Time with Intel CAT by Identifying LLC Critical Applications , 2018, Euro-Par.

[27]  Jichuan Chang,et al.  Cooperative Caching for Chip Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[28]  F. Y. Gardes,et al.  10 Gb/s integrated tunable hybrid III–V/Si laser and silicon Mach-Zehnder modulator , 2012, 2012 38th European Conference and Exhibition on Optical Communications.

[29]  Ana Pont,et al.  Splitting the data cache: a survey , 2000, IEEE Concurr..

[30]  Luca P. Carloni,et al.  On the Design of a Photonic Network-on-Chip , 2007, First International Symposium on Networks-on-Chip (NOCS'07).

[31]  Valentin Puente,et al.  ESP-NUCA: A low-cost adaptive Non-Uniform Cache Architecture , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[32]  Andrew B. Kahng,et al.  ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[33]  Daniel Sánchez,et al.  Jenga: Software-defined cache hierarchies , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[34]  Sandro Bartolini,et al.  A Simple On-Chip Optical Interconnection for Improving Performance of Coherency Traffic in CMPs , 2012, 2012 15th Euromicro Conference on Digital System Design.

[35]  Pedro López,et al.  Combining recency of information with selective random and a victim cache in last-level caches , 2012, TACO.

[36]  Gary S. Tyson,et al.  Utilizing reuse information in data cache management , 1998, ICS '98.

[37]  Zeshan Chishti,et al.  Optimizing replication, communication, and capacity allocation in CMPs , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[38]  Ana Pont,et al.  The filter cache: a run-time cache management approach , 1999, Proceedings 25th EUROMICRO Conference. Informatics: Theory and Practice for the New Millennium.

[39]  Babak Falsafi,et al.  Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.

[40]  Margaret Martonosi,et al.  Cache decay: exploiting generational behavior to reduce cache leakage power , 2001, ISCA 2001.

[41]  David R. Kaeli,et al.  Exploiting temporal locality in drowsy cache policies , 2005, CF '05.

[42]  Lieven Eeckhout,et al.  Application Clustering Policies to Address System Fairness with Intel’s Cache Allocation Technology , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[43]  Julio Sahuquillo,et al.  Multi2Sim: A Simulation Framework to Evaluate Multicore-Multithreaded Processors , 2007 .

[44]  Bruce Jacob,et al.  DRAMSim2: A Cycle Accurate Memory System Simulator , 2011, IEEE Computer Architecture Letters.