Reducing network-on-chip energy consumption through spatial locality speculation

As processor chips become increasingly parallel, an efficient communication substrate is critical for meeting performance and energy targets. In this work, we target the root cause of network energy consumption through techniques that reduce link and router-level switching activity. We specifically focus on memory subsystem traffic, as it comprises the bulk of NoC load in a CMP. By transmitting only the flits that contain words predicted useful using a novel spatial locality predictor, our scheme seeks to reduce network activity. We aim to further lower NoC energy through microarchitectural mechanisms that inhibit datapath switching activity for unused words in individual flits. Using simulation-based performance studies and detailed energy models based on synthesized router designs and different link wire types, we show that (a) the prediction mechanism achieves very high accuracy, with an average misprediction rate of just 2.5%; (b) the combined NoC energy savings enabled by the predictor and microarchitectural support are 35% on average and up to 60% in the best case; and (c) the performance impact of these energy optimizations is negligible.

[1]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[2]  Stephen W. Keckler,et al.  Segment gating for static energy reduction in networks-on-chip , 2009, 2009 2nd International Workshop on Network on Chip Architectures.

[3]  Ronald G. Dreslinski,et al.  The M5 Simulator: Modeling Networked Systems , 2006, IEEE Micro.

[4]  Sriram R. Vangal,et al.  A 5-GHz Mesh Interconnect for a Teraflops Processor , 2007, IEEE Micro.

[5]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[6]  Babak Falsafi,et al.  Accurate and complexity-effective spatial pattern prediction , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[7]  Arnab Banerjee,et al.  A Power and Energy Exploration of Network-on-Chip Architectures , 2007, First International Symposium on Networks-on-Chip (NOCS'07).

[8]  James R. Larus,et al.  Making Pointer-Based Data Structures Cache Conscious , 2000, Computer.

[9]  Anant Agarwal,et al.  Scalar operand networks: on-chip interconnect for ILP in partitioned architectures , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[10]  Eric Rotenberg,et al.  Assigning confidence to conditional branch predictions , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[11]  Chita R. Das,et al.  Performance and power optimization through data compression in Network-on-Chip architectures , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[12]  Aneesh Aggarwal,et al.  Cache Noise Prediction , 2008, IEEE Transactions on Computers.

[13]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[14]  Li Shang,et al.  Dynamic voltage scaling with links for power optimization of interconnection networks , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[15]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[16]  W. Dally,et al.  Route packets, not wires: on-chip interconnection networks , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[17]  Eisse Mensink,et al.  Low-Power, High-Speed Transceivers for Network-on-Chip Communication , 2009, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[18]  Yale N. Patt,et al.  Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[19]  Chau-Wen Tseng,et al.  Compiler optimizations for improving data locality , 1994, ASPLOS VI.

[20]  Ki Hwan Yum,et al.  Adaptive data compression for high-performance low-power on-chip networks , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[21]  Chandra Krintz,et al.  Cache-conscious data placement , 1998, ASPLOS VIII.

[22]  Andrew B. Kahng,et al.  ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[23]  James R. Larus,et al.  Cache-conscious structure definition , 1999, PLDI '99.

[24]  John S. Liptay,et al.  Structural Aspects of the System/360 Model 85 II: The Cache , 1968, IBM Syst. J..

[25]  Babak Falsafi,et al.  Dead-block prediction & dead-block correlating prefetchers , 2001, ISCA 2001.

[26]  George Varghese,et al.  Low-swing on-chip signaling techniques: effectiveness and robustness , 2000, IEEE Trans. Very Large Scale Integr. Syst..

[27]  Matthias S. Müller,et al.  Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.