A fine-grained link-level fault-tolerant mechanism for networks-on-chip

Silicon technology scaling is continuously enabling denser integration capabilities. However, this comes at the expense of higher variability and susceptibility to wear-out. With an escalating number of on-chip components expected to be defective in near-future chips, modern parallel systems, such as Chip Multi-Processors (CMP), become especially vulnerable to these faults. Just a single link failure in the underlying Network on-Chip (NoC) may cause inter-tile communication to halt and even deadlock, rendering the chip useless. While fault-tolerant routing schemes do exist, they can only handle a finite number of link faults. In this paper, we address permanent wire failures which can occur in on-chip parallel links at manufacture-time or while in operation. Instead of marking the entire link as faulty, we present a methodology where the Partially Faulty Link (PFL) can still be used to transfer data between NoC routers, thus maintaining network connectivity, extending the yield and lifetime of the chip, and allowing for graceful performance degradation. To achieve this, we devise architectural augmentations both to the router and link micro-architectures, along with link fault detection, diagnosis, and re-configuration at the level of wire granularity. Statistical link-level fault models present the usability of PFLs, while relevant load-balancing routing algorithms and low-cost re-transmission mechanisms are also presented and coupled to the proposed architecture. Hardware synthesis demonstrates the feasibility of the proposed extensions to the base NoC architecture. Results obtained from full-system simulations show that high-performance NoCs are realizable in the presence of PFLs.

[1]  A. Kolodny,et al.  Comparative analysis of serial vs parallel links in NoC , 2004, 2004 International Symposium on System-on-Chip, 2004. Proceedings..

[2]  Valentin Puente,et al.  Immunet: a cheap and robust fault-tolerant packet routing mechanism , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[3]  Nicola Campregher,et al.  FPGA interconnect fault tolerance , 2005, International Conference on Field Programmable Logic and Applications, 2005..

[4]  John G. Proakis,et al.  Digital Communications , 1983 .

[5]  Simon W. Moore,et al.  A communication characterisation of Splash-2 and Parsec , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[6]  Saurabh Dighe,et al.  An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[7]  Sudhakar Yalamanchili,et al.  Interconnection Networks: An Engineering Approach , 2002 .

[8]  W. Dally,et al.  Route packets, not wires: on-chip interconnection networks , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[9]  David Blaauw,et al.  A highly resilient routing algorithm for fault-tolerant NoCs , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[10]  L. Benini,et al.  Xpipes: a network-on-chip architecture for gigascale systems-on-chip , 2004, IEEE Circuits and Systems Magazine.

[11]  David Blaauw,et al.  Vicis: A reliable network for unreliable silicon , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[12]  Vincenzo Catania,et al.  Leveraging Partially Faulty Links Usage for Enhancing Yield and Performance in Networks-on-Chip , 2010, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[13]  David Wentzlaff,et al.  Processor: A 64-Core SoC with Mesh Interconnect , 2010 .

[14]  Chita R. Das,et al.  Exploring Fault-Tolerant Network-on-Chip Architectures , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[15]  Jörg Henkel,et al.  Configurable links for runtime adaptive on-chip communication , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[16]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[17]  Partha Pratim Pande,et al.  NoC Interconnect Yield Improvement Using Crosspoint Redundancy , 2006, 2006 21st IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems.

[18]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[19]  David Blaauw,et al.  Reliability modeling and management in dynamic microprocessor-based systems , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[20]  William J. Dally,et al.  The Reliable Router: A Reliable and High-Performance Communication Substrate for Parallel Computers , 1994, PCRCW.

[21]  Scott A. Mahlke,et al.  BulletProof: a defect-tolerant CMP switch architecture , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[22]  Loren Schwiebert,et al.  Optimal Fully Adaptive Minimal Wormhole Routing for Meshes , 1995, J. Parallel Distributed Comput..

[23]  Luca Benini,et al.  Networks on chips - technology and tools , 2006, The Morgan Kaufmann series in systems on silicon.

[24]  Philip Koopman,et al.  Cyclic redundancy code (CRC) polynomial selection for embedded networks , 2004, International Conference on Dependable Systems and Networks, 2004.

[25]  Paul Ampadu,et al.  Self-Adaptive System for Addressing Permanent Errors in On-Chip Interconnects , 2010, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[26]  Niraj K. Jha,et al.  GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[27]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[28]  José Duato A Theory of Fault-Tolerant Routing in Wormhole Networks , 1997, IEEE Trans. Parallel Distributed Syst..