Spatial Locality Speculation to Reduce Energy in Chip-Multiprocessor Networks-on-Chip

As processor chips become increasingly parallel, an efficient communication substrate is critical for meeting performance and energy targets. In this work, we target the root cause of network energy consumption through techniques that reduce link and router-level switching activity. We specifically focus on memory subsystem traffic, as it comprises the bulk of NoC load in a CMP. By transmitting only the flits that contain words predicted useful using a novel spatial locality predictor, our scheme seeks to reduce network activity. We aim to further lower NoC energy through microarchitectural mechanisms that inhibit datapath switching activity for unused words in individual flits. Using simulation-based performance studies and detailed energy models based on synthesized router designs and different link wire types, we show that 1) the prediction mechanism achieves very high accuracy, with an average rate of false-unused prediction of just 2.5 percent; 2) the combined NoC energy savings enabled by the predictor and microarchitectural support is 36 percent, on average, and up to 57 percent in the best case; and 3) there is no system performance penalty as a result of this technique.

[1]  Stephen W. Keckler,et al.  Segment gating for static energy reduction in networks-on-chip , 2009, 2009 2nd International Workshop on Network on Chip Architectures.

[2]  Chau-Wen Tseng,et al.  Compiler optimizations for improving data locality , 1994, ASPLOS VI.

[3]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[4]  Anant Agarwal,et al.  Scalar operand networks: on-chip interconnect for ILP in partitioned architectures , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[5]  Doug Burger,et al.  Implementation and Evaluation of On-Chip Network Architectures , 2006, 2006 International Conference on Computer Design.

[6]  Doe Hyun Yoon,et al.  Adaptive granularity memory systems: A tradeoff between storage efficiency and throughput , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[7]  Ronald G. Dreslinski,et al.  The M5 Simulator: Modeling Networked Systems , 2006, IEEE Micro.

[8]  Daniel A. Jiménez,et al.  Reducing network-on-chip energy consumption through spatial locality speculation , 2011, Proceedings of the Fifth ACM/IEEE International Symposium.

[9]  Gu-Yeon Wei,et al.  Process Variation Tolerant 3T1D-Based Cache Architectures , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[10]  Yale N. Patt,et al.  Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[11]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[12]  Walid A. Najjar,et al.  An evaluation of bottom-up and top-down thread generation techniques , 1993, MICRO 1993.

[13]  Arnab Banerjee,et al.  A Power and Energy Exploration of Network-on-Chip Architectures , 2007, First International Symposium on Networks-on-Chip (NOCS'07).

[14]  John S. Liptay,et al.  Structural Aspects of the System/360 Model 85 II: The Cache , 1968, IBM Syst. J..

[15]  W. Dally,et al.  Route packets, not wires: on-chip interconnection networks , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[16]  James R. Larus,et al.  Making Pointer-Based Data Structures Cache Conscious , 2000, Computer.

[17]  Li Shang,et al.  Dynamic voltage scaling with links for power optimization of interconnection networks , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[18]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[19]  George Varghese,et al.  Low-swing on-chip signaling techniques: effectiveness and robustness , 2000, IEEE Trans. Very Large Scale Integr. Syst..

[20]  Matthias S. Müller,et al.  Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[21]  Babak Falsafi,et al.  Accurate and complexity-effective spatial pattern prediction , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[22]  Ki Hwan Yum,et al.  Adaptive data compression for high-performance low-power on-chip networks , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[23]  Eisse Mensink,et al.  Low-Power, High-Speed Transceivers for Network-on-Chip Communication , 2009, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[24]  Chandra Krintz,et al.  Cache-conscious data placement , 1998, ASPLOS VIII.

[25]  Karthikeyan Sankaralingam,et al.  Implementation and Evaluation of a Dynamically Routed Processor Operand Network , 2007, First International Symposium on Networks-on-Chip (NOCS'07).

[26]  Sriram R. Vangal,et al.  A 5-GHz Mesh Interconnect for a Teraflops Processor , 2007, IEEE Micro.

[27]  Andrew B. Kahng,et al.  ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[28]  James R. Larus,et al.  Cache-conscious structure definition , 1999, PLDI '99.

[29]  Chita R. Das,et al.  Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[30]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[31]  Babak Falsafi,et al.  Dead-block prediction & dead-block correlating prefetchers , 2001, ISCA 2001.

[32]  Eric Rotenberg,et al.  Assigning confidence to conditional branch predictions , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[33]  Chita R. Das,et al.  Performance and power optimization through data compression in Network-on-Chip architectures , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[34]  Aneesh Aggarwal,et al.  Cache Noise Prediction , 2008, IEEE Transactions on Computers.