Multi-bit transient fault control for NoC links using 2D fault coding method

In deep nanometer scale, Network-on-Chip (NoC) links are more prone to multi-bit transient fault. Conventional ECC techniques brings heavy area, power, and timing overheads when correcting and detecting multiple transient faults. Therefore, a cost-effective ECC technique, named 2D fault coding method, is adopted to overcome the multi-bit transient fault issue of NoC links. Its key innovation is that the wires of a link are treated as its matrix appearance and light-weight Parity Check Coding (PCC) is performed on the matrix's two dimensions (horizontal matrix rows and vertical matrix columns). Horizontal PCCs and vertical PCCs work together to find the faults' position and then correct them by simply inverting them. The procedure of using the 2D fault coding method to protect a NoC link is proposed, its correction and detection capability is analyzed, and its hardware implementation is carried out. Comparative experiments show that the proposal can largely reduce the ECC hardware cost, have much higher fault detection coverage, maintain almost zero silent fault percentages, and have higher fault correction percentages normalized under the same area, demonstrating that it is cost-effective and suitable to the multi-bit transient fault control for NoC links.

[1]  Lawrence Clark,et al.  Delay and Area Efficient First-level Cache Soft Error Detection and Correction , 2006, 2006 International Conference on Computer Design.

[2]  T. Moon Error Correction Coding: Mathematical Methods and Algorithms , 2005 .

[3]  Luca Benini,et al.  Error control schemes for on-chip communication links: the energy-reliability tradeoff , 2005, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[4]  Axel Jantsch,et al.  Methods for fault tolerance in networks-on-chip , 2013, CSUR.

[5]  Babak Falsafi,et al.  Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[6]  Santanu Chattopadhyay,et al.  A spare router based reliable Network-on-Chip design , 2014, 2014 IEEE International Symposium on Circuits and Systems (ISCAS).

[7]  Naresh R. Shanbhag,et al.  Coding for systern-on-chip networks: a unified framework , 2004, Proceedings. 41st Design Automation Conference, 2004..

[8]  Ram Huggahalli,et al.  Impact of Cache Coherence Protocols on the Processing of Network Traffic , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[9]  Nur A. Touba,et al.  Reliable Network-on-Chip Using a Low Cost Unequal Error Protection Code , 2007, 22nd IEEE International Symposium on Defect and Fault-Tolerance in VLSI Systems (DFT 2007).

[10]  Dhiraj K. Pradhan,et al.  Matrix Codes for Reliable and Cost Efficient Memory Chips , 2011, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[11]  Lei Xie,et al.  REPAIR: A Reliable Partial-Redundancy-Based Router in NoC , 2013, 2013 IEEE Eighth International Conference on Networking, Architecture and Storage.

[12]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[13]  Dhiraj K. Pradhan,et al.  Matrix Codes: Multiple Bit Upsets Tolerant Method for SRAM Memories , 2007, 22nd IEEE International Symposium on Defect and Fault-Tolerance in VLSI Systems (DFT 2007).

[14]  Pasi Liljeberg,et al.  Analysis of forward error correction methods for nanoscale networks-on-chip , 2007, Nano-Net.

[15]  Paul Ampadu,et al.  Adaptive Error Control for NoC Switch-to-Switch Links in a Variable Noise Environment , 2008, 2008 IEEE International Symposium on Defect and Fault Tolerance of VLSI Systems.

[16]  Ahmed Louri,et al.  A Soft Error Tolerant Network-on-Chip Router Pipeline for Multi-Core Systems , 2015, IEEE Computer Architecture Letters.