Fault-Tolerant Flow Control in On-chip Networks

Scaling of interconnects exacerbates the already challenging reliability of on-chip networks. Although many researchers have provided various fault handling techniques in chip multi-processors (CMPs), the fault-tolerance of the interconnection network is yet to adequately evolve. As an end-to-end recovery approach delays fault detection and complicates recovery to a consistent global state in such a system, a link-level retransmission is endorsed for recovery, making a higher-level protocol simple. In this paper, we introduce a fault-tolerant flow control scheme for soft error handling in on-chip networks. The fault-tolerant flow control recovers errors at a link-level by requesting retransmission and ensures an error-free transmission on a flit-basis with incorporation of dynamic packet fragmentation. Dynamic packet fragmentation is adopted as a part of fault-tolerant flow control to disengage flits from the fault-containment and recover the faulty flit transmission. Thus, the proposed router provides a high level of dependability at the link-level for both datapath and control planes. In simulation with injected faults, the proposed router is observed to perform well, gracefully degrading while exhibiting 97% error coverage in datapath elements. The proposed router has been implemented using a TSMC 45nm standard cell library. As compared to a router which employs triple modular redundancy (TMR) in datapath elements, the proposed router takes 58% less area and consumes 40% less energy per packet on average.

[1]  José Duato,et al.  A fault-tolerant directory-based cache coherence protocol for CMP architectures , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[2]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[3]  Jeffrey T. Draper,et al.  Dynamic packet fragmentation for increased virtual channel utilization in on-chip routers , 2009, 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip.

[4]  Joel S. Emer,et al.  The soft error problem: an architectural perspective , 2005, 11th International Symposium on High-Performance Computer Architecture.

[5]  Chita R. Das,et al.  A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip Networks , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[6]  Scott A. Mahlke,et al.  BulletProof: a defect-tolerant CMP switch architecture , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[7]  Axel Jantsch,et al.  A fault model notation and error-control scheme for switch-to-switch buses in a network-on-chip , 2003, First IEEE/ACM/IFIP International Conference on Hardware/ Software Codesign and Systems Synthesis (IEEE Cat. No.03TH8721).

[8]  A.F. Witulski,et al.  Models and Algorithmic Limits for an ECC-Based Approach to Hardening Sub-100-nm SRAMs , 2007, IEEE Transactions on Nuclear Science.

[9]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[10]  William J. Dally,et al.  The Reliable Router: A Reliable and High-Performance Communication Substrate for Parallel Computers , 1994, PCRCW.

[11]  Martin Radetzki,et al.  Fault-tolerant architecture and deflection routing for degradable NoC switches , 2009, 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip.

[12]  Luigi Carro,et al.  Dependable Network-on-Chip Router Able to Simultaneously Tolerate Soft Errors and Crosstalk , 2006, 2006 IEEE International Test Conference.

[13]  Massoud Pedram,et al.  Resilient Dynamic Power Management under Uncertainty , 2008, 2008 Design, Automation and Test in Europe.

[14]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[15]  Chita R. Das,et al.  Exploring Fault-Tolerant Network-on-Chip Architectures , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[16]  Milo M. K. Martin,et al.  SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[17]  Radu Marculescu,et al.  Towards on-chip fault-tolerant communication , 2003, ASP-DAC '03.

[18]  Philip Koopman,et al.  Efficient High Hamming Distance CRCs for Embedded Networks , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[19]  Luca Benini,et al.  Analysis of error recovery schemes for networks on chips , 2005, IEEE Design & Test of Computers.

[20]  Jeffrey T. Draper,et al.  Multicast routing with dynamic packet fragmentation , 2009, GLSVLSI '09.