Integrated Coherence Prediction: Towards Efficient Cache Coherence on NoC-Based Multicore Architectures

Multicore architectures with Network-on-Chips (NoCs) have been widely recognized as the de facto design for the efficient utilization of the continuously increasing density of transistors on a chip. A key challenge in designing such an NoC-based multicore processor is maintaining cache coherence in an efficient manner. Directory-based protocols avoid the bandwidth overhead of snoop-based protocols, therefore scaling to a large number of cores. However, conventional directory structures add significant indirection delay to cache-to-cache accesses in larger multicore processor. In this article we propose a novel hardware coherence technique, called integrated coherence prediction (ICP). This approach adopts a prediction technique for managing shared data to reduce or eliminate the cache-to-cache delay in coherence accesses. ICP has two unique features that differ from previous coherence prediction techniques. First, ICP introduces a new integrated prediction scheme that combines two kinds of predictors: owner predictor, which predicts the data writers and avoids the indirection through directory, and data predictor, which predicts the access address and prefetches data from remote nodes directly. Second, ICP uses a request replication method to reduce the negative effect of wrong owner prediction operations, thus facilitating overall performance improvement. We present the design and implementation details of the ICP approach. Using detailed full-system simulations, we conclude that the ICP provides a cost-effective solution for designing high-performance multicore processors.

[1]  Josep Torrellas,et al.  Data forwarding in scalable shared-memory multiprocessors , 1995, ICS '95.

[2]  Andrew B. Kahng,et al.  ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[3]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[4]  Thomas F. Wenisch,et al.  Store-ordered streaming of shared memory , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[5]  Stefanos Kaxiras,et al.  Coherence communication prediction in shared-memory multiprocessors , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[6]  Babak Falsafi,et al.  Memory sharing predictor: the key to a speculative coherent DSM , 1999, ISCA.

[7]  José González,et al.  Owner Prediction for Accelerating Cache-to-Cache Transfer Misses in a cc-NUMA Architecture , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[8]  Tarek El-Ghazawi,et al.  An adaptive cache coherence protocol for chip multiprocessors , 2010, IFMT '10.

[9]  Sarita V. Adve,et al.  An evaluation of fine-grain producer-initiated communication in cache-coherent multiprocessors , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[10]  Stefanos Kaxiras,et al.  Improving CC-NUMA performance using Instruction-based Prediction , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[11]  Jaehyuk Huh,et al.  A NUCA Substrate for Flexible CMP Cache Sharing , 2007, IEEE Transactions on Parallel and Distributed Systems.

[12]  George Kurian,et al.  The locality-aware adaptive cache coherence protocol , 2013, ISCA.

[13]  Ronald G. Dreslinski,et al.  The M5 Simulator: Modeling Networked Systems , 2006, IEEE Micro.

[14]  José González,et al.  The use of prediction for accelerating upgrade misses in cc-NUMA multiprocessors , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[15]  Sangyeun Cho,et al.  Predicting Coherence Communication by Tracking Synchronization Points at Run Time , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[16]  Nong Xiao,et al.  VBON: Toward efficient on-chip networks via hierarchical virtual bus , 2013, Microprocess. Microsystems.

[17]  Nong Xiao,et al.  An optimized multicore cache coherence design for exploiting communication locality , 2012, GLSVLSI '12.

[18]  Richard E. Kessler,et al.  Evaluating stream buffers as a secondary cache replacement , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[19]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.

[20]  Milo M. K. Martin,et al.  Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors , 2003, ISCA '03.

[21]  Natalie D. Enright Jerger,et al.  Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[22]  Amirali Baniasadi,et al.  A Power-Aware Prediction-Based Cache Coherence Protocol for Chip Multiprocessors , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[23]  John B. Carter,et al.  An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[24]  Mark D. Hill,et al.  Using prediction to accelerate coherence protocols , 1998, ISCA.

[25]  Michael C. Huang,et al.  Improving support for locality and fine-grain sharing in chip multiprocessors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[26]  Mario Lodde,et al.  Built-in fast gather control network for efficient support of coherence protocols , 2013, IET Comput. Digit. Tech..

[27]  T. Lovett,et al.  STiNG: A CC-NUMA Computer System for the Commercial Marketplace , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[28]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[29]  Laxmi N. Bhuyan,et al.  Switch cache: a framework for improving the remote memory access latency of CC-NUMA multiprocessors , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[30]  Alberto Ros,et al.  DiCo-CMP: Efficient cache coherency in tiled CMP architectures , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[31]  Hai Zhou,et al.  Parallel CAD: Algorithm Design and Programming Special Section Call for Papers TODAES: ACM Transactions on Design Automation of Electronic Systems , 2010 .

[32]  Michael L. Scott,et al.  The effect of network total order, broadcast, and remote-write capability on network-based shared memory computing , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[33]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[34]  Jaehyuk Huh,et al.  A NUCA substrate for flexible CMP cache sharing , 2005, ICS.

[35]  Li-Shiuan Peh,et al.  In-network cache coherence , 2006, IEEE Comput. Archit. Lett..

[36]  Maged M. Michael,et al.  Design and performance of directory caches for scalable shared memory multiprocessors , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[37]  David L. Dill,et al.  The Murphi Verification System , 1996, CAV.

[38]  Henry Hoffmann,et al.  On-Chip Interconnection Architecture of the Tile Processor , 2007, IEEE Micro.

[39]  Anoop Gupta,et al.  The Stanford FLASH multiprocessor , 1994, ISCA '94.

[40]  Karthik Ramani,et al.  Interconnect-Aware Coherence Protocols for Chip Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[41]  Manoj Franklin,et al.  Perceptron Based Consumer Prediction in Shared-Memory Multiprocessors , 2006, 2006 International Conference on Computer Design.

[42]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[43]  Stefanos Kaxiras,et al.  SARC Coherence: Scaling Directory Cache Coherence in Performance and Power , 2010, IEEE Micro.