From online fault detection to fault management in Network-on-Chips: A ground-up approach

Due to the ongoing miniaturization of silicon technology beyond the sub-micron domain and the trend of integrating ever more components on a single chip, the Network-on-Chip (NoC) paradigm has emerged to address the scalability and performance shortcomings of bus-based interconnects. As the feature size shrinks, the system gets much more susceptible to faults caused by wear-out and environmental effects. Thus, in order to increase the reliability, creates the need for having mechanisms embedded into such a system that could detect and manage the faults in run-time. In this paper, a ground-up approach from fault detection to fault management for such a NoC-based system on chip is proposed that utilizes both local fault management for fast reaction to faults and a global fault management mechanisms for triggering a large-scale reconfiguration of the NoC. Also, detailed description of strategies for fault detection, localization, classification and propagation to a global fault management unit are provided and methods for local fault management are elaborated.

[1]  Manuel E. Acacio,et al.  Heterogeneous Interconnects for Energy-Efficient Message Management in CMPs , 2010, IEEE Transactions on Computers.

[2]  Sergei Devadze,et al.  On-line fault classification and handling in IEEE1687 based fault management system for complex SoCs , 2016, 2016 17th Latin-American Test Symposium (LATS).

[3]  William J. Dally,et al.  Design tradeoffs for tiled CMP on-chip networks , 2006, ICS '06.

[4]  W. Dally,et al.  Route packets, not wires: on-chip interconnection networks , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[5]  José Duato,et al.  Logic-Based Distributed Routing for NoCs , 2008, IEEE Computer Architecture Letters.

[6]  Kewal K. Saluja,et al.  An implementation and analysis of a concurrent built-in self-test technique , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[7]  Valeria Bertacco,et al.  Formally enhanced runtime verification to ensure NoC functional correctness , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[8]  Guangjun Li,et al.  Tolerating transient illegal turn faults in NoCs , 2016, Microprocess. Microsystems.

[9]  Jaan Raik,et al.  SoCDep2: A framework for dependable task deployment on many-core systems under mixed-criticality constraints , 2016, 2016 11th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC).

[10]  Sergei Devadze,et al.  Reliable health monitoring and fault management infrastructure based on embedded instrumentation and IEEE 1687 , 2016, 2016 IEEE AUTOTESTCON.

[11]  Thais Webber,et al.  A fault prediction module for a fault tolerant NoC operation , 2015, Sixteenth International Symposium on Quality Electronic Design.

[12]  Chrysostomos Nicopoulos,et al.  NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[13]  Rami G. Melhem,et al.  Déjà Vu Switching for Multiplane NoCs , 2012, 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip.

[14]  Luca Benini,et al.  Networks on Chips : A New SoC Paradigm , 2022 .

[15]  Alessandro Strano,et al.  OSR-Lite: Fast and deadlock-free NoC reconfiguration framework , 2012, 2012 International Conference on Embedded Computer Systems (SAMOS).

[16]  Jaan Raik,et al.  Automated minimization of concurrent online checkers for Network-on-Chips , 2015, 2015 10th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC).

[17]  Giovanni De Micheli,et al.  CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers , 2012, 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip.

[18]  Atefe Dalirsani,et al.  Self-diagnosis in Network-on-Chips , 2015 .

[19]  Jaan Raik,et al.  A Framework for Combining Concurrent Checking and On-Line Embedded Test for Low-Latency Fault Detection in NoC Routers , 2015, NOCS.

[20]  Heinrich Theodor Vierhaus,et al.  Design and Test Technology for Dependable Systems-on-Chip , 2010 .

[21]  Sudhir K. Satpathy,et al.  Catnap: energy proportional multiple network-on-chip , 2013, ISCA.

[22]  Paulo Cortez,et al.  Scenario preprocessing approach for the reconfiguration of fault-tolerant NoC-based MPSoCs , 2016, Microprocess. Microsystems.

[23]  Masaru Fukushi,et al.  A fault-tolerant routing method for 2D-mesh Network-on-Chips based on components of a router , 2016, 2016 IEEE 5th Global Conference on Consumer Electronics.

[24]  Jaan Raik,et al.  Holistic Approach for Fault-Tolerant Network-on-Chip based Many-Core Systems , 2016, ArXiv.

[25]  Nur A. Touba,et al.  Synthesis of low power CED circuits based on parity codes , 2005, 23rd IEEE VLSI Test Symposium (VTS'05).

[26]  Martin Radetzki,et al.  Multi-Layer Diagnosis for Fault-Tolerant Networks-on-Chip , 2017, IEEE Transactions on Computers.

[27]  Jaan Raik,et al.  A Framework for Comprehensive Automated Evaluation of Concurrent Online Checkers , 2015, 2015 Euromicro Conference on Digital System Design.

[28]  Chrysostomos Nicopoulos,et al.  An Online and Real-Time Fault Detection and Localization Mechanism for Network-on-Chip Architectures , 2016, ACM Trans. Archit. Code Optim..

[29]  Natalie D. Enright Jerger,et al.  NoC Architectures for Silicon Interposer Systems: Why Pay for more Wires when you Can Get them (from your interposer) for Free? , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.