Optical overlay NUCA: A high speed substrate for shared L2 caches

In this paper, we propose to use optical NOCs to design cache access protocols for large shared L2 caches. We observe that the problem is unique because optical networks have very low latency, and in principle all the cache banks are very close to each other. A naive approach is to broadcast a request to a set of banks that might possibly contain the copy of a block. However, this approach is wasteful in terms of energy and bandwidth. Hence, we propose a novel scheme in this paper, TSI, which proposes to create a set of virtual networks (overlays) of cache banks over a physical optical NOC. We search for a block inside each overlay using a combination of multicast and unicast messages. We additionally create support for our overlay networks by proposing optimizations to the previously proposed R-SWMR network. We also propose a set of novel hardware structures for creating and managing overlays, and for efficiently locating blocks in the overlay. The performance of the TSI scheme is within 2-3% of a broadcast scheme, and it is faster than traditional static NUCA schemes by 50%. As compared to the broadcast scheme it reduces the number of accesses, and consequently the dynamic energy by 20-30%.

[1]  Karthik Ramani,et al.  Interconnect-Aware Coherence Protocols for Chip Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[2]  N. Binkert,et al.  Atomic Coherence: Leveraging nanophotonics to build race-free cache coherence protocols , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[3]  George Kurian,et al.  ATAC: A 1000-core cache-coherent processor with on-chip optical network , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[4]  Graham T. Reed,et al.  Silicon Photonics: The State of the Art , 2008 .

[5]  Simon W. Moore,et al.  Low-latency virtual-channel routers for on-chip networks , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[6]  Ian O'Connor,et al.  Optical Ring Network-on-Chip (ORNoC): Architecture and design methodology , 2011, 2011 Design, Automation & Test in Europe.

[7]  Jiang Jiang,et al.  PSA-NUCA: A Pressure Self-Adapting Dynamic Non-uniform Cache Architecture , 2012, 2012 IEEE Seventh International Conference on Networking, Architecture, and Storage.

[8]  Valentin Puente,et al.  ESP-NUCA: A low-cost adaptive Non-Uniform Cache Architecture , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[9]  Yu Zhang,et al.  Firefly: illuminating future network-on-chip with nanophotonics , 2009, ISCA '09.

[10]  David A. Wood,et al.  ASR: Adaptive Selective Replication for CMP Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[11]  Jung Ho Ahn,et al.  Corona: System Implications of Emerging Nanophotonic Technology , 2008, 2008 International Symposium on Computer Architecture.

[12]  Avinash Karanth Kodi,et al.  Extending the Performance and Energy-Efficiency of Shared Memory Multicores with Nanophotonic Technology , 2014, IEEE Transactions on Parallel and Distributed Systems.

[13]  Changkyu Kim,et al.  Nonuniform Cache Architectures for Wire-Delay Dominated On-Chip Caches , 2003, IEEE Micro.

[14]  Smruti R. Sarangi,et al.  ParTejas: A parallel simulator for multicore processors , 2014, ISPASS.

[15]  John Kim,et al.  FlexiShare: Channel sharing for an energy-efficient nanophotonic crossbar , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[16]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[17]  Mikko H. Lipasti,et al.  Wavelength stealing: An opportunistic approach to channel sharing in multi-chip photonic interconnects , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18]  José F. Martínez,et al.  A power-efficient all-optical on-chip interconnect using wavelength-based oblivious routing , 2010, ASPLOS XV.

[19]  Jun Yang,et al.  A composite and scalable cache coherence protocol for large scale CMPs , 2011, ICS '11.

[20]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[21]  Xi Chen,et al.  HERMES: A Hierarchical Broadcast-Based Silicon Photonic Interconnect for Scalable Many-Core Systems , 2014, ArXiv.

[22]  Hugo Thienpont,et al.  Architectural study of the opportunities for reconfigurable optical interconnects in distributed shared memory systems , 2004 .

[23]  Nevin Kirman,et al.  A power-efficient all-optical on-chip interconnect using wavelength-based oblivious routing , 2010, ASPLOS 2010.

[24]  Alyssa B. Apsel,et al.  Leveraging Optical Technology in Future Bus-based Chip Multiprocessors , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[25]  David H. Albonesi,et al.  Phastlane: a rapid transit optical routing network , 2009, ISCA '09.

[26]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[27]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[28]  Shaahin Hessabi,et al.  All-Optical Wavelength-Routed Architecture for a Power-Efficient Network on Chip , 2014, IEEE Transactions on Computers.

[29]  Babak Falsafi,et al.  Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.