Optical overlay NUCA: A high speed substrate for shared L2 caches

In this paper, we propose to use optical NOCs to design cache access protocols for large shared L2 caches. We observe that the problem is unique because optical networks have very low latency, and in principle all the cache banks are very close to each other. A naive approach is to broadcast a request to a set of banks that might possibly contain the copy of a block. However, this approach is wasteful in terms of energy and bandwidth. Hence, we propose a novel scheme in this paper, TSI, which proposes to create a set of virtual networks (overlays) of cache banks over a physical optical NOC. We search for a block inside each overlay using a combination of multicast and unicast messages. We additionally create support for our overlay networks by proposing optimizations to the previously proposed R-SWMR network. We also propose a set of novel hardware structures for creating and managing overlays, and for efficiently locating blocks in the overlay. The performance of the TSI scheme is within 2-3% of a broadcast scheme, and it is faster than traditional static NUCA schemes by 50%. As compared to the broadcast scheme it reduces the number of accesses, and consequently the dynamic energy by 20-30%.

[1]  Yu Zhang,et al.  Firefly: illuminating future network-on-chip with nanophotonics , 2009, ISCA '09.

[2]  David H. Albonesi,et al.  Phastlane: a rapid transit optical routing network , 2009, ISCA '09.

[3]  Smruti R. Sarangi,et al.  ParTejas , 2017, ACM Trans. Model. Comput. Simul..

[4]  Mikko H. Lipasti,et al.  Light speed arbitration and flow control for nanophotonic interconnects , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[5]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[6]  Mikko H. Lipasti,et al.  Wavelength stealing: An opportunistic approach to channel sharing in multi-chip photonic interconnects , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[7]  John Kim,et al.  FlexiShare: Channel sharing for an energy-efficient nanophotonic crossbar , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[8]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[9]  Xi Chen,et al.  HERMES: A Hierarchical Broadcast-Based Silicon Photonic Interconnect for Scalable Many-Core Systems , 2014, ArXiv.

[10]  Smruti R. Sarangi,et al.  FP-NUCA: A Fast NOC Layer for Implementing Large NUCA Caches , 2015, IEEE Transactions on Parallel and Distributed Systems.

[11]  Simon W. Moore,et al.  Low-latency virtual-channel routers for on-chip networks , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[12]  Yawei Yin,et al.  Towards a scalable, low-power all-optical architecture for networks-on-chip , 2014, ACM Trans. Embed. Comput. Syst..

[13]  Alyssa B. Apsel,et al.  Leveraging Optical Technology in Future Bus-based Chip Multiprocessors , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[14]  Jun Yang,et al.  A composite and scalable cache coherence protocol for large scale CMPs , 2011, ICS '11.

[15]  N. Binkert,et al.  Atomic Coherence: Leveraging nanophotonics to build race-free cache coherence protocols , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[16]  GuptaAnoop,et al.  The SPLASH-2 programs , 1995 .

[17]  Smruti R. Sarangi,et al.  Optical overlay NUCA: A high speed substrate for shared L2 caches , 2014, HiPC.

[18]  José F. Martínez,et al.  A power-efficient all-optical on-chip interconnect using wavelength-based oblivious routing , 2010, ASPLOS XV.

[19]  David A. Wood,et al.  ASR: Adaptive Selective Replication for CMP Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[20]  Avinash Kodi,et al.  Energy-efficient optical Network-on-Chip architecture for heterogeneous multicores , 2016, 2016 IEEE Optical Interconnects Conference (OI).

[21]  Hui Chen,et al.  On-Chip Optical Interconnect Roadmap: Challenges and Critical Directions , 2005, IEEE Journal of Selected Topics in Quantum Electronics.

[22]  Shaahin Hessabi,et al.  All-Optical Wavelength-Routed Architecture for a Power-Efficient Network on Chip , 2014, IEEE Transactions on Computers.

[23]  Ian O'Connor,et al.  Optical Ring Network-on-Chip (ORNoC): Architecture and design methodology , 2011, 2011 Design, Automation & Test in Europe.

[24]  Avinash Karanth Kodi,et al.  Extending the Performance and Energy-Efficiency of Shared Memory Multicores with Nanophotonic Technology , 2014, IEEE Transactions on Parallel and Distributed Systems.

[25]  Babak Falsafi,et al.  Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.

[26]  Jiang Jiang,et al.  PSA-NUCA: A Pressure Self-Adapting Dynamic Non-uniform Cache Architecture , 2012, 2012 IEEE Seventh International Conference on Networking, Architecture, and Storage.

[27]  Jung Ho Ahn,et al.  Corona: System Implications of Emerging Nanophotonic Technology , 2008, 2008 International Symposium on Computer Architecture.

[28]  Avinash Karanth Kodi,et al.  Exploring the Design of 64- and 256-Core Power Efficient Nanophotonic Interconnect , 2010, IEEE Journal of Selected Topics in Quantum Electronics.

[29]  Kevin Skadron,et al.  Scaling with Design Constraints: Predicting the Future of Big Chips , 2011, IEEE Micro.

[30]  George Kurian,et al.  ATAC: A 1000-core cache-coherent processor with on-chip optical network , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[31]  Smruti R. Sarangi,et al.  ColdBus: A Near-Optimal Power Efficient Optical Bus , 2015, 2015 IEEE 22nd International Conference on High Performance Computing (HiPC).

[32]  Prathmesh Kallurkar,et al.  Tejas: A java based versatile micro-architectural simulator , 2015, 2015 25th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS).

[33]  Valentin Puente,et al.  ESP-NUCA: A low-cost adaptive Non-Uniform Cache Architecture , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[34]  Hugo Thienpont,et al.  Architectural study of the opportunities for reconfigurable optical interconnects in distributed shared memory systems , 2004 .

[35]  Nevin Kirman,et al.  A power-efficient all-optical on-chip interconnect using wavelength-based oblivious routing , 2010, ASPLOS 2010.

[36]  Karthik Ramani,et al.  Interconnect-Aware Coherence Protocols for Chip Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[37]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[38]  Nikolaos Hardavellas,et al.  EcoLaser: An adaptive laser control for energy-efficient on-chip photonic interconnects , 2014, 2014 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).

[39]  Changkyu Kim,et al.  Nonuniform Cache Architectures for Wire-Delay Dominated On-Chip Caches , 2003, IEEE Micro.

[40]  Li Zhou,et al.  PROBE: Prediction-based optical bandwidth scaling for energy-efficient NoCs , 2013, 2013 Seventh IEEE/ACM International Symposium on Networks-on-Chip (NoCS).

[41]  Graham T. Reed,et al.  Silicon Photonics: The State of the Art , 2008 .

[42]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).