A Survey on Design Approaches to Circumvent Permanent Faults in Networks-on-Chip

Increasing fault rates in current and future technology nodes coupled with on-chip components in the hundreds calls for robust and fault-tolerant Network-on-Chip (NoC) designs. Given the central role of NoCs in today’s many-core chips, permanent faults impeding their original functionality may significantly influence performance, energy consumption, and correct operation of the entire system. As a result, fault-tolerant NoC design gained much attention in recent years. In this article, we review the vast research efforts regarding a NoC’s components, namely, topology, routing algorithm, router microarchitecture, as well as system-level approaches combined with reconfiguration; discuss the proposed architectures; and identify outstanding research questions.

[1]  Yong-Bin Kim,et al.  Fault Tolerant Source Routing for Network-on-chip , 2007, 22nd IEEE International Symposium on Defect and Fault-Tolerance in VLSI Systems (DFT 2007).

[2]  Valeria Bertacco,et al.  Formally enhanced runtime verification to ensure NoC functional correctness , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  Wolfgang Rosenstiel,et al.  Fully Adaptive Fault-Tolerant Routing Algorithm for Network-on-Chip Architectures , 2007, 10th Euromicro Conference on Digital System Design Architectures, Methods and Tools (DSD 2007).

[4]  Dharma P. Agrawal,et al.  A novel deadlock-free routing technique for a class of de Bruijn graph based networks , 1995, Proceedings of 9th International Parallel Processing Symposium.

[5]  Martin Hopkins,et al.  Synergistic Processing in Cell's Multicore Architecture , 2006, IEEE Micro.

[6]  Yingtao Jiang,et al.  Fault-tolerant routing schemes in RDT(2,2,1)//spl alpha/-based interconnection network for networks-on-chip design , 2005, 8th International Symposium on Parallel Architectures,Algorithms and Networks (ISPAN'05).

[7]  Chita R. Das,et al.  A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip Networks , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[8]  Mahmood Fathy,et al.  AFRA: A low cost high performance reliable routing for 3D mesh NoCs , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[9]  Siamak Mohammadi,et al.  A fault-tolerant and congestion-aware routing algorithm for Networks-on-Chip , 2010, 13th IEEE Symposium on Design and Diagnostics of Electronic Circuits and Systems.

[10]  José Duato,et al.  Region-Based Routing: A Mechanism to Support Efficient Routing Algorithms in NoCs , 2009 .

[11]  Alessandro Strano,et al.  OSR-Lite: Fast and deadlock-free NoC reconfiguration framework , 2012, 2012 International Conference on Embedded Computer Systems (SAMOS).

[12]  Paul Ampadu,et al.  A Dual-Layer Method for Transient and Permanent Error Co-Management in NoC Links , 2011, IEEE Transactions on Circuits and Systems II: Express Briefs.

[13]  Ahmed Louri,et al.  QORE: A fault tolerant network-on-chip architecture with power-efficient quad-function channel (QFC) buffers , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[14]  Alexandre M. Amory,et al.  Topology-agnostic fault-tolerant NoC routing method , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[15]  Joel Baker,et al.  De Bruijn graphs and their applications to fault tolerant networks , 2011 .

[16]  Zeljko Zilic,et al.  Reliability aware NoC router architecture using input channel buffer sharing , 2009, GLSVLSI '09.

[17]  Masoud Daneshtalab,et al.  High Performance Fault-Tolerant Routing Algorithm for NoC-Based Many-Core Systems , 2013, 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[18]  Luca Benini,et al.  A distributed and topology-agnostic approach for on-line NoC testing , 2011, Proceedings of the Fifth ACM/IEEE International Symposium.

[19]  S. Sheikhaei,et al.  Reliability of wireless on-chip interconnects based on carbon nanotube antennas , 2008, 2008 IEEE 14th International Mixed-Signals, Sensors, and Systems Test Workshop.

[20]  Ding-Zhu Du,et al.  The Hamiltonian property of generalized de Bruijn digraphs , 1991, J. Comb. Theory, Ser. B.

[21]  Paolo Prinetto,et al.  Reliability in Application Specific Mesh-Based NoC Architectures , 2008, 2008 14th IEEE International On-Line Testing Symposium.

[22]  Sheng-De Wang,et al.  An Improved Algorithm for Fault-Tolerant Routing in Hypercubes , 1997, IEEE Trans. Computers.

[23]  Jimmy J. M. Tan,et al.  A recursively construction scheme for super fault-tolerant hamiltonian graphs , 2006, Appl. Math. Comput..

[24]  Hideharu Amano,et al.  A Lightweight Fault-Tolerant Mechanism for Network-on-Chip , 2008, Second ACM/IEEE International Symposium on Networks-on-Chip (nocs 2008).

[25]  José Duato,et al.  994 International Conference on Parallel Processing a Necessary and Sufficient Condition for Deadlock-free Adaptive Routing in Wormhole Networks , 2022 .

[26]  Sorin Cotofana,et al.  A Novel Flit Serialization Strategy to Utilize Partially Faulty Links in Networks-on-Chip , 2012, 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip.

[27]  Ahmed Louri,et al.  Tackling Permanent Faults in the Network-on-Chip Router Pipeline , 2013, 2013 25th International Symposium on Computer Architecture and High Performance Computing.

[28]  Dhiraj K. Pradhan,et al.  Reliable network-on-chip based on generalized de Bruijn graph , 2007, 2007 IEEE International High Level Design Validation and Test Workshop.

[29]  Hideharu Amano,et al.  Message transfer algorithms on the recursive diagonal torus , 1994, Proceedings of the International Symposium on Parallel Architectures, Algorithms and Networks (ISPAN).

[30]  Lei Zhang,et al.  Scalable and fault-tolerant network-on-chip design usingthe quartered recursive diagonal torus topology , 2008, GLSVLSI '08.

[31]  Dhiraj K. Pradhan,et al.  De Bruijn Graph as a Low Latency Scalable Architecture for Energy Efficient Massive NoCs , 2008, 2008 Design, Automation and Test in Europe.

[32]  Li-Shiuan Peh,et al.  ARIADNE: Agnostic Reconfiguration in a Disconnected Network Environment , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[33]  David Blaauw,et al.  A highly resilient routing algorithm for fault-tolerant NoCs , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[34]  Jie Wu,et al.  A deterministic fault-tolerant and deadlock-free routing protocol in 2-D meshes based on odd-even turn model , 2002, ICS '02.

[35]  Huawei Li,et al.  ZoneDefense: A Fault-Tolerant Routing for 2-D Meshes Without Virtual Channels , 2014, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[36]  Paul Ampadu,et al.  Fine-grained splitting methods to address permanent errors in Network-on-Chip links , 2012, 2012 IEEE International Symposium on Circuits and Systems.

[37]  Michele Favalli,et al.  Exploiting Network-on-Chip structural redundancy for a cooperative and scalable built-in self-test architecture , 2011, 2011 Design, Automation & Test in Europe.

[38]  Chita R. Das,et al.  A low latency router supporting adaptivity for on-chip interconnects , 2005, Proceedings. 42nd Design Automation Conference, 2005..

[39]  José Duato,et al.  Segment-based routing: an efficient fault-tolerant routing algorithm for meshes and tori , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[40]  Jeffrey T. Draper,et al.  Characterization of a Fault-tolerant NoC Router , 2007, 2007 IEEE International Symposium on Circuits and Systems.

[41]  Luca Benini,et al.  A low-overhead fault tolerance scheme for TSV-based 3D network on chip links , 2008, ICCAD 2008.

[42]  William J. Dally,et al.  Deadlock-Free Adaptive Routing in Multicomputer Networks Using Virtual Channels , 1993, IEEE Trans. Parallel Distributed Syst..

[43]  Pasi Liljeberg,et al.  Online Reconfigurable Self-Timed Links for Fault Tolerant NoC , 2007, VLSI Design.

[44]  Michele Favalli,et al.  A complete self-testing and self-configuring NoC infrastructure for cost-effective MPSoCs , 2013, TECS.

[45]  S. Martel,et al.  System Design of an Integrated Measurement Electronic Subsystem for Bacteria Detection Using an Electrode Array and MC-1 Magnetotactic Bacteria , 2007, 2006 International Workshop on Computer Architecture for Machine Perception and Sensing.

[46]  Hideharu Amano,et al.  Recursive Diagonal Torus: an interconnection network for massively parallel computers , 1993, Proceedings of 1993 5th IEEE Symposium on Parallel and Distributed Processing.

[47]  Davide Bertozzi,et al.  Synergistic use of multiple on-chip networks for ultra-low latency and scalable distributed routing reconfiguration , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[48]  Cheng Li,et al.  Network-on-Chip (NoC) Topologies and Performance: A Review , 2011 .

[49]  Hideharu Amano,et al.  A torus assignment for an interconnection network recursive diagonal torus , 1999, Proceedings Fourth International Symposium on Parallel Architectures, Algorithms, and Networks (I-SPAN'99).

[50]  Leibo Liu,et al.  A fault tolerant NoC architecture using quad-spare mesh topology and dynamic reconfiguration , 2013, J. Syst. Archit..

[51]  Dajin Wang,et al.  A Rectilinear-Monotone Polygonal Fault Block Model for Fault-Tolerant Minimal Routing in Mesh , 2003, IEEE Trans. Computers.

[52]  David Blaauw,et al.  Vicis: A reliable network for unreliable silicon , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[53]  Dhiraj K. Pradhan,et al.  Low Latency and Energy Efficient Scalable Architecture for Massive NoCs Using Generalized de Bruijn Graph , 2011, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[54]  Rajesh Kumar,et al.  Haswell: A Family of IA 22 nm Processors , 2015, IEEE Journal of Solid-State Circuits.

[55]  Wilfred Gomes,et al.  5.9 Haswell: A family of IA 22nm processors , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[56]  Ching-Te Chiu,et al.  On the design and analysis of fault tolerant NoC architecture using spare routers , 2011, 16th Asia and South Pacific Design Automation Conference (ASP-DAC 2011).

[57]  Zeljko Zilic,et al.  ERAVC: Enhanced reliability aware NoC router , 2011, 2011 12th International Symposium on Quality Electronic Design.

[58]  Partha Pratim Pande,et al.  Complex network inspired fault-tolerant NoC architectures with wireless links , 2011, Proceedings of the Fifth ACM/IEEE International Symposium.

[59]  Emmanouil Kalligeros,et al.  Low-cost fault-tolerant switch allocator for network-on-chip routers , 2012, INA-OCMC '12.

[60]  Jeong-Gun Lee,et al.  Implications of Rent's Rule for NoC Design and Its Fault-Tolerance , 2007, First International Symposium on Networks-on-Chip (NOCS'07).

[61]  William J. Dally,et al.  Research Challenges for On-Chip Interconnection Networks , 2007, IEEE Micro.

[62]  Partha Pratim Pande,et al.  Performance Evaluation of Adaptive Routing Algorithms for achieving Fault Tolerance in NoC Fabrics , 2007, 2007 IEEE International Conf. on Application-specific Systems, Architectures and Processors (ASAP).

[63]  Paul Ampadu,et al.  Transient and Permanent Error Control for High-End Multiprocessor Systems-on-Chip , 2012, 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip.

[64]  Natalie D. Enright Jerger,et al.  QuT: A low-power optical Network-on-Chip , 2014, 2014 Eighth IEEE/ACM International Symposium on Networks-on-Chip (NoCS).

[65]  Hannu Tenhunen,et al.  Congestion aware, fault tolerant, and thermally efficient inter-layer communication scheme for hybrid NoC-bus 3D architectures , 2011, Proceedings of the Fifth ACM/IEEE International Symposium.

[66]  Michael Welzl,et al.  A Fault tolerant mechanism for handling Permanent and Transient Failures in a Network on Chip , 2007, Fourth International Conference on Information Technology (ITNG'07).

[67]  M NiLionel,et al.  The turn model for adaptive routing , 1992 .

[68]  共立出版株式会社 コンピュータ・サイエンス : ACM computing surveys , 1978 .

[69]  Davide Bertozzi,et al.  Designing Network On-Chip Architectures in the Nanoscale Era , 2010 .

[70]  Lionel M. Ni,et al.  The Turn Model for Adaptive Routing , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[71]  Jong-Hoon Youn,et al.  Fault-tolerant wormhole routing algorithms in meshes in the presence of concave faults , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[72]  Jie Wu,et al.  Fault-tolerant and deadlock-free routing in 2-D meshes using rectilinear-monotone polygonal fault blocks , 2005, Parallel Algorithms Appl..

[73]  William J. Dally,et al.  Flattened Butterfly Topology for On-Chip Networks , 2007, IEEE Comput. Archit. Lett..

[74]  Yingtao Jiang,et al.  On the Physicl Layout of PRDT-Based NoCs , 2007, Fourth International Conference on Information Technology (ITNG'07).

[75]  Zeljko Zilic,et al.  NISHA: A fault-tolerant NoC router enabling deadlock-free Interconnection of Subnets in Hierarchical Architectures , 2013, J. Syst. Archit..

[76]  Hannu Tenhunen,et al.  Minimal-path fault-tolerant approach using connection-retaining structure in Networks-on-Chip , 2013, 2013 Seventh IEEE/ACM International Symposium on Networks-on-Chip (NoCS).

[77]  Luca Benini,et al.  Networks on chips - technology and tools , 2006, The Morgan Kaufmann series in systems on silicon.

[78]  Alain Greiner,et al.  A reconfigurable routing algorithm for a fault-tolerant 2D-Mesh Network-on-Chip , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[79]  David Blaauw,et al.  A Reliable Routing Architecture and Algorithm for NoCs , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[80]  Scott A. Mahlke,et al.  BulletProof: a defect-tolerant CMP switch architecture , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[81]  José Duato,et al.  A New Theory of Deadlock-Free Adaptive Routing in Wormhole Networks , 1993, IEEE Trans. Parallel Distributed Syst..

[82]  Cristian Constantinescu,et al.  Trends and Challenges in VLSI Circuit Reliability , 2003, IEEE Micro.

[83]  Paulo F. Butzen,et al.  An array-based test circuit for fully automated gate dielectric breakdown characterization , 2008, 2008 IEEE Custom Integrated Circuits Conference.

[84]  Qiang Xu,et al.  Defect Tolerance in Homogeneous Manycore Processors Using Core-Level Redundancy with Unified Topology , 2008, 2008 Design, Automation and Test in Europe.

[85]  Ahmed Louri,et al.  Dynamic Reconfiguration of 3D Photonic Networks-on-Chip for Maximizing Performance and Improving Fault Tolerance , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[86]  Luca Benini,et al.  Characterization and Implementation of Fault-Tolerant Vertical Links for 3-D Networks-on-Chip , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[87]  Sudeep Pasricha,et al.  A low overhead fault tolerant routing scheme for 3D Networks-on-Chip , 2011, 2011 12th International Symposium on Quality Electronic Design.

[88]  Luca Benini,et al.  Bringing NoCs to 65 nm , 2007, IEEE Micro.

[89]  Xiaowei Li,et al.  A resilient on-chip router design through data path salvaging , 2011, 16th Asia and South Pacific Design Automation Conference (ASP-DAC 2011).

[90]  Dhiraj K. Pradhan,et al.  The De Bruijn Multiprocessor Network: A Versatile Parallel Processing and Sorting Network for VLSI , 1989, IEEE Trans. Computers.

[91]  Yingtao Jiang,et al.  Topology and Binary Routing Schemes of A PRDT-Based NoC , 2007, Fourth International Conference on Information Technology (ITNG'07).

[92]  Valeria Bertacco,et al.  uDIREC: Unified diagnosis and reconfiguration for frugal bypass of NoC faults , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[93]  Axel Jantsch,et al.  Methods for fault tolerance in networks-on-chip , 2013, CSUR.

[94]  Soroush Khaleghi,et al.  Spare sharing network enhancement for scalable systems , 2013, 2013 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS).

[95]  Ahmed Louri,et al.  ROBUST: a new self-healing fault-tolerant NoC router , 2011, NoCArc '11.

[96]  Nanning Zheng,et al.  Fault-tolerant routing for on-chip network without using virtual channels , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[97]  Chrysostomos Nicopoulos,et al.  A fine-grained link-level fault-tolerant mechanism for networks-on-chip , 2010, 2010 IEEE International Conference on Computer Design.

[98]  Akram Ben Ahmed,et al.  Graceful deadlock-free fault-tolerant routing algorithm for 3D Network-on-Chip architectures , 2014, J. Parallel Distributed Comput..

[99]  Masaru Fukushi,et al.  Fault-Tolerant Routing Algorithm for Network on Chip without Virtual Channels , 2009, 2009 24th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems.

[100]  Benoît Dupont de Dinechin,et al.  A Distributed Run-Time Environment for the Kalray MPPA®-256 Integrated Manycore Processor , 2013, ICCS.

[101]  Jie Wu,et al.  A Fault-Tolerant and Deadlock-Free Routing Protocol in 2D Meshes Based on Odd-Even Turn Model , 2003, IEEE Trans. Computers.

[102]  David Wentzlaff,et al.  Processor: A 64-Core SoC with Mesh Interconnect , 2010 .

[103]  Chrysostomos Nicopoulos,et al.  Dynamic fault-tolerant routing algorithm for networks-on-chip based on localised detouring paths , 2013, IET Comput. Digit. Tech..

[104]  Jianhao Hu,et al.  De Bruijn graph based 3D Network on Chip architecture design , 2009, 2009 International Conference on Communications, Circuits and Systems.

[105]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[106]  Dhiraj K. Pradhan,et al.  Wormhole routing in de Bruijn networks and hyper-de Bruijn networks , 2003, Proceedings of the 2003 International Symposium on Circuits and Systems, 2003. ISCAS '03..

[107]  Vijay Laxmi,et al.  d2-LBDR: Distance-driven routing to handle permanent failures in 2D mesh NoCs , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[108]  Yu Zhang,et al.  Firefly: illuminating future network-on-chip with nanophotonics , 2009, ISCA '09.

[109]  Yang Yu,et al.  A RDT-based interconnection network for scalable network-on-chip designs , 2005, International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II.

[110]  Nacer-Eddine Zergainoh,et al.  Fault-tolerant adaptive routing under an unconstrained set of node and link failures for many-core systems-on-chip , 2014, Microprocess. Microsystems.

[111]  Nacer-Eddine Zergainoh,et al.  Fault-Tolerant Deadlock-Free Adaptive Routing for Any Set of Link and Node Failures in Multi-cores Systems , 2010, 2010 Ninth IEEE International Symposium on Network Computing and Applications.

[112]  Sun Xuemei,et al.  Fault-tolerant routing in A PRDT(2,1)-based NoC , 2010, 2010 2nd International Conference on Computer Engineering and Technology.

[113]  José Duato,et al.  Logic-Based Distributed Routing for NoCs , 2008, IEEE Computer Architecture Letters.

[114]  Valeria Bertacco,et al.  Brisk and limited-impact NoC routing reconfiguration , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[115]  Federico Silla,et al.  Addressing Manufacturing Challenges with Cost-Efficient Fault Tolerant Routing , 2010, 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip.