FP-NUCA: A Fast NOC Layer for Implementing Large NUCA Caches

NUCA caches have traditionally been proposed as a solution for mitigating wire delays, and delays introduced due to complex networks on chip. Traditional approaches have reported significant performance gains with intelligent block placement, location, replication, and migration schemes. In this paper, we propose a novel approach in this space, called FP-NUCA. It differs from conventional approaches, and relies on a novel method of co-designing the last level cache and the network on chip. We artificially constrain the communication pattern in the NUCA cache such that all the messages travel along a few predefined paths (fast paths) for each set of banks. We leverage this communication pattern by designing a new type of NOC router called the Freeze router, which augments a regular router by adding a layer of circuitry that gates the clock of the regular router when there is a fast path message waiting to be transmitted. Messages along the fast path do not require buffering, switching, or routing. We incorporate a bank predictor with our novel NOC for reducing the number of messages, and resultant energy consumption. We compare our performance with state of the art protocols, and report speedups of up to 31 percent (mean: 6.3 percent), and ED2 reduction up to 46 percent (mean: 10.4 percent) for a suite of Splash and Parsec benchmarks. We implement the Freeze router in VHDL and show that the additional fast path logic has minimal area and timing overheads.

[1]  Lizhong Chen,et al.  Critical Bubble Scheme: An Efficient Implementation of Globally Aware Network Flow Control , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[2]  Luca P. Carloni,et al.  Networks-on-chip in emerging interconnect paradigms: Advantages and challenges , 2009, 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip.

[3]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[4]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[5]  Jens Sparsø,et al.  ReNoC: A Network-on-Chip Architecture with Reconfigurable Topology , 2008, Second ACM/IEEE International Symposium on Networks-on-Chip (nocs 2008).

[6]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[7]  Changkyu Kim,et al.  Nonuniform Cache Architectures for Wire-Delay Dominated On-Chip Caches , 2003, IEEE Micro.

[8]  Kevin Skadron,et al.  Scaling with Design Constraints: Predicting the Future of Big Chips , 2011, IEEE Micro.

[9]  Jaehyuk Huh,et al.  A NUCA Substrate for Flexible CMP Cache Sharing , 2007, IEEE Transactions on Parallel and Distributed Systems.

[10]  C. Cascaval,et al.  Calculating stack distances efficiently , 2003, MSP '02.

[11]  Niraj K. Jha,et al.  Express virtual channels: towards the ideal interconnection fabric , 2007, ISCA '07.

[12]  Hamid Sarbazi-Azad,et al.  Virtual Point-to-Point Connections for NoCs , 2010, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[13]  Niraj K. Jha,et al.  GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[14]  Zeshan Chishti,et al.  Distance Associativity for High-Performance Energy-Efficient Non-Uniform Cache Architectures , 2003, MICRO.

[15]  Natalie D. Enright Jerger,et al.  On-Chip Networks , 2009, On-Chip Networks.

[16]  Krste Asanovic,et al.  Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[17]  Zeshan Chishti,et al.  Optimizing replication, communication, and capacity allocation in CMPs , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[18]  Anantha Chandrakasan,et al.  SMART: A single-cycle reconfigurable NoC for SoC applications , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[19]  Jiang Jiang,et al.  PSA-NUCA: A Pressure Self-Adapting Dynamic Non-uniform Cache Architecture , 2012, 2012 IEEE Seventh International Conference on Networking, Architecture, and Storage.

[20]  Hyunjin Lee,et al.  CloudCache: Expanding and shrinking private caches , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[21]  David A. Padua,et al.  Calculating stack distances efficiently , 2002, MSP/ISMM.

[22]  Li-Shiuan Peh,et al.  Locality-oblivious cache organization leveraging single-cycle multi-hop NoCs , 2014, ASPLOS.

[23]  Valentin Puente,et al.  ESP-NUCA: A low-cost adaptive Non-Uniform Cache Architecture , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[24]  Amit Kumar,et al.  NoC with Near-Ideal Express Virtual Channels Using Global-Line Communication , 2008, 2008 16th IEEE Symposium on High Performance Interconnects.

[25]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[26]  José González,et al.  Elastic cooperative caching: an autonomous dynamically adaptive memory hierarchy for chip multiprocessors , 2010, ISCA.

[27]  Smruti R. Sarangi,et al.  ParTejas , 2017, ACM Trans. Model. Comput. Simul..

[28]  Antonio González,et al.  Memory bank predictors , 2005, 2005 International Conference on Computer Design.

[29]  Valentin Puente,et al.  SP-NUCA: a cost effective dynamic non-uniform cache architecture , 2008, CARN.

[30]  Babak Falsafi,et al.  Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.

[31]  Andrew B. Kahng,et al.  ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[32]  Natalie D. Enright Jerger,et al.  Leaving One Slot Empty: Flit Bubble Flow Control for Torus Cache-Coherent NoCs , 2015, IEEE Transactions on Computers.

[33]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[34]  Hao Luo,et al.  Characterizing Active Data Sharing in Threaded Applications Using Shared Footprint , 2013 .