On-chip sensor networks for soft-error tolerant real-time multiprocessor systems-on-chip

As transistor density continues to increase with the advent of nanotechnology, reliability issues raised by the more frequent appearance of soft errors are becoming critical for future embedded multiprocessor systems design. State-of-the-art techniques for soft error protections targeting multiprocessor systems result either high chip cost and area overhead or high performance degradation and energy consumption, and do not fulfill the increasing requirements for high performance and dependability. In this article we present a systematic approach, that is, the Sensor Networks-on-Chip (SENoC), to collaboratively and efficiently manage on-chip applications and overcome reliability threats to Multiprocessor Systems-on-Chip (MPSoC). A hardware-software collaborative approach is proposed to solve soft error problems: a hardware-based on-chip sensor network is built for soft error detection, and a software-based recovery mechanism is applied for soft error correction. A two-step scheduling scheme is presented for reliable application and chip management, combining an off-line static optimization stage for application performance maximization and an online lightweight dynamic adjustment stage to handle runtime variations and exceptions. This strategy introduces only trivial overhead on hardware design and much lower overhead on software control and execution, and hence performance degradation and energy consumption is greatly reduced. We build a cycle-accurate simulator using SystemC, and verify the effectiveness of our technique by comparing performance with related techniques on several real-world applications.

[1]  Marco Torchiano,et al.  Soft-error detection through software fault-tolerance techniques , 1999, Proceedings 1999 IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (EFT'99).

[2]  M. Sonza-Reorda,et al.  A software fault tolerance method for safety-critical systems: effectiveness and drawbacks , 2002, Proceedings. 15th Symposium on Integrated Circuits and Systems Design.

[3]  Rashed Zafar Bhatti,et al.  Analysis of Soft Error Mitigation Techniques for Register Files in IBM Cu-08 90nm Technology , 2006, 2006 49th IEEE International Midwest Symposium on Circuits and Systems.

[4]  Niraj K. Jha,et al.  Fault-tolerant computer system design , 1996, IEEE Parallel & Distributed Technology: Systems & Applications.

[5]  Subhasish Mitra Globally Optimized Robust Systems to Overcome Scaled CMOS Reliability Challenges , 2008, 2008 Design, Automation and Test in Europe.

[6]  Cristian Constantinescu Neutron SER characterization of microprocessors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[7]  Meeta Sharma Gupta,et al.  Error Tolerance in Server Class Processors , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[8]  Yu Wang,et al.  A case study of on-chip sensor network in multiprocessor system-on-chip , 2009, CASES '09.

[9]  Wei Zhang,et al.  A NoC Traffic Suite Based on Real Applications , 2011, 2011 IEEE Computer Society Annual Symposium on VLSI.

[10]  Luigi Carro,et al.  Crosstalk- and SEU-Aware Networks on Chips , 2007, IEEE Design & Test of Computers.

[11]  Chita R. Das,et al.  Exploring Fault-Tolerant Network-on-Chip Architectures , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[12]  Petru Eles,et al.  Analysis and optimization of fault-tolerant embedded systems with hardened processors , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[13]  Xiaowen Wu,et al.  Satisfiability Modulo Graph Theory for Task Mapping and Scheduling on Multiprocessor Systems , 2011, IEEE Transactions on Parallel and Distributed Systems.

[14]  Babak Falsafi,et al.  Reunion: Complexity-Effective Multicore Redundancy , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[15]  Ming Zhang,et al.  Combinational Logic Soft Error Correction , 2006, 2006 IEEE International Test Conference.

[16]  Petru Eles,et al.  Fault and energy-aware communication mapping with guaranteed latency for applications implemented on NoC , 2005, Proceedings. 42nd Design Automation Conference, 2005..

[17]  Wei Qin,et al.  Prototyping a fault-tolerant multiprocessor SoC with run-time fault recovery , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[18]  Rami G. Melhem,et al.  Fault-Tolerance Through Scheduling of Aperiodic Tasks in Hard Real-Time Multiprocessor Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[19]  Prabhakar Kudva,et al.  Soft-error resilience of the IBM POWER6 processor input/output subsystem , 2008, IBM J. Res. Dev..

[20]  David Blaauw,et al.  Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation , 2003, MICRO.

[21]  Hironori Kasahara,et al.  A standard task graph set for fair evaluation of multiprocessor scheduling algorithms , 2002 .

[22]  Nur A. Touba,et al.  Reliable Network-on-Chip Using a Low Cost Unequal Error Protection Code , 2007, 22nd IEEE International Symposium on Defect and Fault-Tolerance in VLSI Systems (DFT 2007).

[23]  M. Nicolaidis,et al.  Design for soft error mitigation , 2005, IEEE Transactions on Device and Materials Reliability.

[24]  Michael Nicolaidis Time redundancy based soft-error tolerance to rescue nanometer technologies , 1999, Proceedings 17th IEEE VLSI Test Symposium (Cat. No.PR00146).

[25]  Sanjit A. Seshia,et al.  Verification-Guided Soft Error Resilience , 2007, 2007 Design, Automation & Test in Europe Conference & Exhibition.

[26]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[27]  Ahmad Patooghy,et al.  A Low-Power and SEU-Tolerant Switch Architecture for Network on Chips , 2007, 13th Pacific Rim International Symposium on Dependable Computing (PRDC 2007).

[28]  Takeshi Kataoka,et al.  A Cost-Effective Dependable Microcontroller Architecture with Instruction-Level Rollback for Soft Error Recovery , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[29]  C. Siva Ram Murthy,et al.  A Fault-Tolerant Dynamic Scheduling Algorithm for Multiprocessor Real-Time Systems and Its Analysis , 1998, IEEE Trans. Parallel Distributed Syst..