A universal ordered NoC design platform for shared-memory MPSoC

Shared memory is the predominant programming model in today's MPSoCs. However, existing SoC on-chip communication standards like AMBA relies on the interconnect for ordering. This is a problem as the number of actors increases, as traditional simple interconnects like buses and crossbars do not scale, yet scalable distributed NoCs are inherently unordered. Without built-in ordering capability from NoC, cache coherence protocols have to rely on external ordering points which can forward the requests so that every cache observes the requests in the same order. Such ordering points incur significant scalability issues though, such as indirection latency or communication hotspots in the network. In this paper, we propose a universal ordered NoC platform for shared-memory MPSoC designs to provide coherence request ordering in addition to communication. The proposed solution is based on a separate light-weight ordering network to establish the global request order which the receiving NIC leverages for delivering requests. The proposed solution provides a comprehensive support for general network topologies and various levels of memory consistency, while adhering to existing cache coherence protocol standards. The full-system simulation with heterogeneous MPSoC Rodinia benchmarks shows that it reduces the request latency by 37.6% and 35.7% over ordering points in 2D-mesh and butterfly fat tree topologies, respectively. This translates to overall runtime improvements of 17.8% and 12.0% in each topology, for a 36-node and 32-node MPSoC respectively.

[1]  Sungjoo Yoo,et al.  In-network reorder buffer to improve overall NoC performance while resolving the in-order requirement problem , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[2]  Pat Conway,et al.  The AMD Opteron Northbridge Architecture , 2007, IEEE Micro.

[3]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[4]  Milo M. K. Martin,et al.  Token Coherence: decoupling performance and correctness , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[5]  Luca Benini,et al.  Designing Application-Specific Networks on Chips with Floorplan Information , 2006, 2006 IEEE/ACM International Conference on Computer Aided Design.

[6]  Hannu Tenhunen,et al.  Memory-Efficient On-Chip Network With Adaptive Interfaces , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[7]  Henry Hoffmann,et al.  On-Chip Interconnection Architecture of the Tile Processor , 2007, IEEE Micro.

[8]  Srinivas Devadas,et al.  Path-Diverse In-Order Routing , 2010, The 2010 International Conference on Green Circuits and Systems.

[9]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[10]  Niraj K. Jha,et al.  In-Network Snoop Ordering (INSO): Snoopy coherence on unordered interconnects , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[11]  Alaa R. Alameldeen,et al.  Timestamp snooping: an approach for extending SMPs , 2000, SIGP.

[12]  Nikil D. Dutt,et al.  Floorplan-aware automated synthesis of bus-based communication architectures , 2005, Proceedings. 42nd Design Automation Conference, 2005..

[13]  Don Anderson HyperTransport Architecture , 2003 .

[14]  Srinivasan Murali,et al.  SUNMAP: a tool for automatic topology selection and generation for NoCs , 2004, Proceedings. 41st Design Automation Conference, 2004..

[15]  Chita R. Das,et al.  Application-aware prioritization mechanisms for on-chip networks , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16]  L. Benini,et al.  Designing Application-Specific Networks on Chips with Floorplan Information , 2006, 2006 IEEE/ACM International Conference on Computer Aided Design.

[17]  Kevin M. Lepak,et al.  Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor , 2010, IEEE Micro.

[18]  Stéphan Jourdan,et al.  Haswell: The Fourth-Generation Intel Core Processor , 2014, IEEE Micro.

[19]  Luca Benini,et al.  A multi-path routing strategy with guaranteed in-order packet delivery and fault-tolerance for networks on chip , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[20]  Luca Benini,et al.  An Application-Specific Design Methodology for On-Chip Crossbar Generation , 2007, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[21]  Luca Benini,et al.  A DRAM Centric NoC Architecture and Topology Design Approach , 2011, 2011 IEEE Computer Society Annual Symposium on VLSI.

[22]  Mahmut T. Kandemir,et al.  A hybrid NoC design for cache coherence optimization for chip multiprocessors , 2012, DAC Design Automation Conference 2012.

[23]  Manuel E. Acacio,et al.  Heterogeneous NoC Design for Efficient Broadcast-based Coherence Protocol Support , 2012, 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip.

[24]  Niraj K. Jha,et al.  GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[25]  Anantha Chandrakasan,et al.  SCORPIO: A 36-core research chip demonstrating snoopy coherence on a scalable mesh NoC with in-network ordering , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[26]  David Z. Pan,et al.  An SDRAM-aware router for Networks-on-Chip , 2009, 2009 46th ACM/IEEE Design Automation Conference.