Transient and Permanent Error Control for High-End Multiprocessor Systems-on-Chip

High-end MPSoC systems with built-in high-radix topologies achieve good performance because of the improved connectivity and the reduced network diameter. In high-end MPSoC systems, fault tolerance support is becoming a compulsory feature. In this work, we propose a combined method to address permanent and transient link and router failures in those systems. The LBDRhr mechanism is proposed to tolerate permanent link failures in some popular high-radix topologies. The increased router complexity may lead to more transient router errors than routers using simple XY routing algorithm. We exploit the inherent information redundancy (IIR) in LBDRhr logic to manage transient errors in the network routers. Thorough analyses are provided to discover the appropriate internal nodes and the forbidden signal patterns for transient error detection. Simulation results show that LBDRhr logic can tolerate all of the permanent failure combinations of long-range links and 80% of links failures at short-range links. Case studies show that the error detection method based on the new IIR extraction method reduces the power consumption and the residual error rate by 33% and up to two orders of magnitude, respectively, compared to triple modular redundancy. The impact of network topologies on the efficiency of the detection mechanism has been examined in this work, as well.

[1]  Davide Bertozzi,et al.  Designing Network On-Chip Architectures in the Nanoscale Era , 2010 .

[2]  S. Medardoni,et al.  Flexible DOR routing for virtualization of multicore chips , 2009, 2009 International Symposium on System-on-Chip.

[3]  William J. Dally,et al.  Research Challenges for On-Chip Interconnection Networks , 2007, IEEE Micro.

[4]  Paul Ampadu,et al.  Dual-Layer Adaptive Error Control for Network-on-Chip Links , 2012, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[5]  David Blaauw,et al.  Vicis: A reliable network for unreliable silicon , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[6]  Marcello Coppola,et al.  Efficient routing implementation in complex systems-on-chip , 2011, Proceedings of the Fifth ACM/IEEE International Symposium.

[7]  B. L. Bhuva,et al.  Analysis of soft error rates in combinational and sequential logic and implications of hardening for advanced technologies , 2010, 2010 IEEE International Reliability Physics Symposium.

[8]  R.C. Baumann,et al.  Radiation-induced soft errors in advanced semiconductor technologies , 2005, IEEE Transactions on Device and Materials Reliability.

[9]  Paul Ampadu,et al.  Exploiting inherent information redundancy to manage transient errors in NoC routing arbitration , 2011, Proceedings of the Fifth ACM/IEEE International Symposium.

[10]  Federico Silla,et al.  Addressing Manufacturing Challenges with Cost-Efficient Fault Tolerant Routing , 2010, 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip.

[11]  Nur A. Touba,et al.  Reliable Network-on-Chip Using a Low Cost Unequal Error Protection Code , 2007, 22nd IEEE International Symposium on Defect and Fault-Tolerance in VLSI Systems (DFT 2007).

[12]  Li-Shiuan Peh,et al.  ARIADNE: Agnostic Reconfiguration in a Disconnected Network Environment , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[13]  Robert E. Lyons,et al.  The Use of Triple-Modular Redundancy to Improve Computer Reliability , 1962, IBM J. Res. Dev..

[14]  Luca Benini,et al.  Networks on chips - technology and tools , 2006, The Morgan Kaufmann series in systems on silicon.

[15]  Naresh R. Shanbhag,et al.  Coding for systern-on-chip networks: a unified framework , 2004, Proceedings. 41st Design Automation Conference, 2004..

[16]  Narayanan Vijaykrishnan,et al.  Optimizing power and performance for reliable on-chip networks , 2010, 2010 15th Asia and South Pacific Design Automation Conference (ASP-DAC).

[17]  Henry Hoffmann,et al.  On-Chip Interconnection Architecture of the Tile Processor , 2007, IEEE Micro.

[18]  Michele Favalli,et al.  Exploiting Network-on-Chip structural redundancy for a cooperative and scalable built-in self-test architecture , 2011, 2011 Design, Automation & Test in Europe.

[19]  Miltos D. Grammatikakis,et al.  Design of Cost-Efficient Interconnect Processing Units , 2008 .

[20]  Doug A. Edwards,et al.  Adaptive stochastic routing in fault-tolerant on-chip networks , 2009, 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip.

[21]  Scott A. Mahlke,et al.  BulletProof: a defect-tolerant CMP switch architecture , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[22]  Valentin Puente,et al.  Immunet: a cheap and robust fault-tolerant packet routing mechanism , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[23]  Mary Jane Irwin,et al.  Reliability Aware Performance and Power Optimization in DVFS-Based On-Chip Networks , 2010 .

[24]  Manfred Glesner,et al.  Deadlock-free routing and component placement for irregular mesh-based networks-on-chip , 2005, ICCAD-2005. IEEE/ACM International Conference on Computer-Aided Design, 2005..

[25]  Luca Benini,et al.  Analysis of error recovery schemes for networks on chips , 2005, IEEE Design & Test of Computers.

[26]  Paul Ampadu,et al.  Error control integration scheme for reliable NoC , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.