Deadlock Recovery in Asynchronous Networks on Chip in the Presence of Transient Faults

Asynchronous Networks-on-Chip (NoCs) have been proposed as a promising infrastructure to provide scalable and efficient on-chip communication for many-core systems. Using the Quasi-delay-insensitive (QDI) implementation, asynchronous NoCs could achieve timing-robustness. However, the advancing semiconductor technology leads to shrinking transistor dimensions and increasing chip density, accelerating the occurrence of faults, especially transient faults. Transient faults emerging on QDI circuits could cause not only data errors (symbol corruption and insertion), but also deadlock. When the deadlock happens on asynchronous NoCs, it can spread over the whole network and paralyse its function. This deadlock has not been fully studied while most traditional fault-tolerant techniques cannot deal with it. Using a new model built for QDI pipelines, the formation and behaviour of the deadlock caused by transient faults are systematically studied. Using the summarized deadlock patterns, the fault position can be precisely located and the fault type can be diagnosed. A fine-grained recovery mechanism is proposed to recover the network from different deadlocks. As a design case, an asynchronous NoC is designed which can recover from the deadlock caused by both transient and permanent faults on links. Detailed experimental results are given.

[1]  Cristian Constantinescu,et al.  Trends and Challenges in VLSI Circuit Reliability , 2003, IEEE Micro.

[2]  Wolfgang Fichtner,et al.  Globally-asynchronous locally-synchronous architectures to simplify the design of on-chip systems , 1999, Twelfth Annual IEEE International ASIC/SOC Conference (Cat. No.99TH8454).

[3]  Axel Jantsch,et al.  Methods for fault tolerance in networks-on-chip , 2013, CSUR.

[4]  Alain J. Martin,et al.  SEU-tolerant QDI circuits [quasi delay-insensitive asynchronous circuits] , 2005, 11th IEEE International Symposium on Asynchronous Circuits and Systems.

[5]  Jim D. Garside,et al.  Protecting QDI interconnects from transient faults using delay-insensitive redundant check codes , 2014, Microprocess. Microsystems.

[6]  R. Aitken,et al.  Reliability analysis reloaded: How will we survive? , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[7]  Doug A. Edwards,et al.  Asynchronous spatial division multiplexing router , 2011, Microprocess. Microsystems.

[8]  Peter Hazucha,et al.  Characterization of soft errors caused by single event upsets in CMOS processes , 2004, IEEE Transactions on Dependable and Secure Computing.

[9]  Vincent Beroulle,et al.  Design-for-test approach of an asynchronous network-on-chip architecture and its associated test pattern generation and application , 2009, IET Comput. Digit. Tech..

[10]  Jim D. Garside,et al.  Fault Tolerant Delay Insensitive Inter-chip Communication , 2009, 2009 15th IEEE Symposium on Asynchronous Circuits and Systems.

[11]  W. Dally,et al.  Route packets, not wires: on-chip interconnection networks , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[12]  Jim D. Garside,et al.  An Asynchronous SDM Network-on-Chip Tolerating Permanent Faults , 2014, 2014 20th IEEE International Symposium on Asynchronous Circuits and Systems.

[13]  Kenneth L. Shepard,et al.  Noise in deep submicron digital design , 1996, Proceedings of International Conference on Computer Aided Design.

[14]  Alain J. Martin Synthesis of Asynchronous VLSI Circuits , 1991 .

[15]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[16]  Steve Furber,et al.  Principles of Asynchronous Circuit Design: A Systems Perspective , 2010 .