A Hardware-Software Collaborated Method for Soft-Error Tolerant MPSoC

Multiprocessor systems-on-chip (MPSoCs) are attractive platforms for embedded applications with growing complexity, because integrating a system or a complex subsystem on a single chip provides better performance and energy efficiency and lower cost per function. As feature sizes and power supply voltages continually decrease, MPSoCs are becoming more susceptible to soft errors. However, traditional soft-error tolerant methods introduce large area, power and performance overheads to MPSoCs. This paper presents a low-overhead hardware-software collaborated method, called SENoC, to dynamically mitigate soft errors on MPSoCs using an on-chip sensor network. We developed a low-cost on-chip sensor network to collaboratively monitor and detect soft errors, and implemented software-based mechanisms to guarantee correct task executions. To maximize the performance of soft-error tolerant MPSoCs, a hybrid scheduling scheme is proposed to effectively manage applications and resources under uncertainties. We studied the new method on MPSoCs with different scales and tested it using typical embedded applications under different cosmic ray flux conditions. Experimental results show that comparing to traditional methods SENoC requires substantially lower protection overheads to achieve the same level of soft-error tolerance. For instance, soft-error tolerant MPSoCs using SENoC archive on average 114.1% better performance than a latest traditional method, and SENoC only introduces 0.42% area overhead to a 256-core MPSoCs.

[1]  Marco Torchiano,et al.  Soft-error detection through software fault-tolerance techniques , 1999, Proceedings 1999 IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (EFT'99).

[2]  Shubu Mukherjee,et al.  Architecture Design for Soft Errors , 2008 .

[3]  Niraj K. Jha,et al.  Fault-tolerant computer system design , 1996, IEEE Parallel & Distributed Technology: Systems & Applications.

[4]  C. Siva Ram Murthy,et al.  A Fault-Tolerant Dynamic Scheduling Algorithm for Multiprocessor Real-Time Systems and Its Analysis , 1998, IEEE Trans. Parallel Distributed Syst..

[5]  Petru Eles,et al.  Analysis and optimization of fault-tolerant embedded systems with hardened processors , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[6]  Xiaowen Wu,et al.  Satisfiability Modulo Graph Theory for Task Mapping and Scheduling on Multiprocessor Systems , 2011, IEEE Transactions on Parallel and Distributed Systems.

[7]  Subhasish Mitra Globally Optimized Robust Systems to Overcome Scaled CMOS Reliability Challenges , 2008, 2008 Design, Automation and Test in Europe.

[8]  Nur A. Touba,et al.  Reliable Network-on-Chip Using a Low Cost Unequal Error Protection Code , 2007, 22nd IEEE International Symposium on Defect and Fault-Tolerance in VLSI Systems (DFT 2007).

[9]  Miguel Correia,et al.  Resilient Intrusion Tolerance through Proactive and Reactive Recovery , 2007 .

[10]  Ahmad Patooghy,et al.  A Low-Power and SEU-Tolerant Switch Architecture for Network on Chips , 2007 .

[11]  Ahmad Patooghy,et al.  A Low-Power and SEU-Tolerant Switch Architecture for Network on Chips , 2007, 13th Pacific Rim International Symposium on Dependable Computing (PRDC 2007).

[12]  Ming Zhang,et al.  Combinational Logic Soft Error Correction , 2006, 2006 IEEE International Test Conference.

[13]  Viswanathan Subramanian,et al.  Low Overhead Soft Error Mitigation Techniques for High-Performance and Aggressive Designs , 2009, IEEE Transactions on Computers.

[14]  M. Sonza-Reorda,et al.  A software fault tolerance method for safety-critical systems: effectiveness and drawbacks , 2002, Proceedings. 15th Symposium on Integrated Circuits and Systems Design.

[15]  Wei Qin,et al.  Prototyping a fault-tolerant multiprocessor SoC with run-time fault recovery , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[16]  Rami G. Melhem,et al.  Fault-Tolerance Through Scheduling of Aperiodic Tasks in Hard Real-Time Multiprocessor Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[17]  M. Nicolaidis,et al.  Design for soft error mitigation , 2005, IEEE Transactions on Device and Materials Reliability.

[18]  Michael Nicolaidis Time redundancy based soft-error tolerance to rescue nanometer technologies , 1999, Proceedings 17th IEEE VLSI Test Symposium (Cat. No.PR00146).

[19]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[20]  Jörg Henkel,et al.  H. 264 HDTV Decoder Using Application-Specific Networks-On-Chip , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[21]  Babak Falsafi,et al.  Reunion: Complexity-Effective Multicore Redundancy , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).