Transient and Permanent Error Co-management Method for Reliable Networks-on-Chip

We propose a transient and permanent error co-management method for NoC links to achieve low latency, high throughput and high reliability, while maintaining energy efficiency. To reduce the energy overhead, a configurable error control coding adapts the number of redundant wires to the varying noise conditions, achieving different error detection capability. Infrequently used redundant wires are used as spare wires to replace broken links. Furthermore, a packet rebuilding/restoring algorithm that cooperates with a shortened error control coding method is proposed to support a low-latency splitting transmission. With this co-management method, we manage transient errors and a small number of permanent errors, without using extra spare wires, to reduce the need for adaptive routing. Simulation results show that the proposed method achieves up to 71% packet latency reduction and 20% throughput improvement, compared to previous methods. Case studies show that our method reduces the energy per packet by up to 68% and 48% for low and high permanent error conditions, respectively.

[1]  Cristian Constantinescu,et al.  Trends and Challenges in VLSI Circuit Reliability , 2003, IEEE Micro.

[2]  T. Dumitras,et al.  Towards on-chip fault-tolerant communication , 2003, Proceedings of the ASP-DAC Asia and South Pacific Design Automation Conference, 2003..

[3]  Paul Ampadu,et al.  Self-Adaptive System for Addressing Permanent Errors in On-Chip Interconnects , 2010, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[4]  Luca Benini,et al.  Error control schemes for on-chip communication links: the energy-reliability tradeoff , 2005, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[5]  Axel Jantsch,et al.  A fault model notation and error-control scheme for switch-to-switch buses in a network-on-chip , 2003, First IEEE/ACM/IFIP International Conference on Hardware/ Software Codesign and Systems Synthesis (IEEE Cat. No.03TH8721).

[6]  Walter Anheier,et al.  Crosstalk fault modeling in defective pair of interconnects , 2008, Integr..

[7]  Kwang-Ting Cheng,et al.  Yield and Cost Analysis of a Reliable NoC , 2009, 2009 27th IEEE VLSI Test Symposium.

[8]  Luca Benini,et al.  Analysis of error recovery schemes for networks on chips , 2005, IEEE Design & Test of Computers.

[9]  Paul Ampadu,et al.  Adaptive error control for nanometer scale network-on-chip links , 2009, IET Comput. Digit. Tech..

[10]  Radu Marculescu Networks-on-chip: the quest for on-chip fault-tolerant communication , 2003, IEEE Computer Society Annual Symposium on VLSI, 2003. Proceedings..

[11]  Pasi Liljeberg,et al.  Online Reconfigurable Self-Timed Links for Fault Tolerant NoC , 2007, VLSI Design.

[12]  Amir Hosseini,et al.  A fault-aware dynamic routing algorithm for on-chip networks , 2008, 2008 IEEE International Symposium on Circuits and Systems.

[13]  Cecilia Metra,et al.  Configurable Error Control Scheme for NoC Signal Integrity , 2007, 13th IEEE International On-Line Testing Symposium (IOLTS 2007).

[14]  S.G. Miremadi,et al.  LTR: A low-overhead and reliable routing algorithm for network on chips , 2008, 2008 International SoC Design Conference.

[15]  Federico Silla,et al.  A new mechanism to deal with process variability in NoC links , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[16]  Shyh-Chyi Wong,et al.  Modeling of interconnect capacitance, delay, and crosstalk in VLSI , 2000 .

[17]  Partha Pratim Pande,et al.  NoC Interconnect Yield Improvement Using Crosspoint Redundancy , 2006, 2006 21st IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems.

[18]  Martin Radetzki,et al.  Fault-tolerant architecture and deflection routing for degradable NoC switches , 2009, 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip.

[19]  Yehea I. Ismail,et al.  Figures of merit to characterize the importance of on-chip inductance , 1998, Proceedings 1998 Design and Automation Conference. 35th DAC. (Cat. No.98CH36175).

[20]  Antonio Robles,et al.  A routing methodology for achieving fault tolerance in direct networks , 2006, IEEE Transactions on Computers.

[21]  Mary Jane Irwin,et al.  Adapative Error Protection for Energy Efficiency , 2003, ICCAD 2003.

[22]  Partha Pratim Pande,et al.  Design of Low Power & Reliable Networks on Chip Through Joint Crosstalk Avoidance and Multiple Error Correction Coding , 2008, J. Electron. Test..

[23]  Bo Fu,et al.  On Hamming Product Codes With Type-II Hybrid ARQ for On-Chip Interconnects , 2009, IEEE Transactions on Circuits and Systems I: Regular Papers.

[24]  Michael Welzl,et al.  A Fault tolerant mechanism for handling Permanent and Transient Failures in a Network on Chip , 2007, Fourth International Conference on Information Technology (ITNG'07).

[25]  Alan C. Thomas,et al.  Level-specific lithography optimization for 1-Gb DRAM , 2000 .

[26]  Naresh R. Shanbhag,et al.  Coding for reliable on-chip buses: a class of fundamental bounds and practical codes , 2007, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[27]  M. J. Irwin,et al.  Adaptive error protection for energy efficiency , 2003, ICCAD-2003. International Conference on Computer Aided Design (IEEE Cat. No.03CH37486).