Distributed Sensor Network-on-Chip for Performance Optimization of Soft-Error-Tolerant Multiprocessor System-on-Chip

As transistor density continues to increase with the advent of nanotechnology, reliability issues raised by more frequently appeared soft errors are becoming even more critical to the next-generation multiprocessor systems. In this paper, we present a systematic approach to address the soft-error problem in multiprocessor system-on-chip with the consideration of system performance optimization. To guarantee the system correctness, a hardware-software collaborated approach is proposed to protect the processors from soft errors. Tiny hardware sensors are embedded in the processor cores to detect the soft errors, and the software-based rollback scheduling mechanisms are applied for error recovery. The protection costs on hardware duplication and software redundancy are effectively reduced. To optimize the system performance, a distributed control system is built on top of the on-chip communication network and collaboratively manages the entire chip for application execution. With the cluster-based task migration techniques, an efficient runtime task remapping and rescheduling algorithm is proposed to further mitigate the overheads induced by soft-error protection and to minimize the total performance degradation. The distributed control strategy makes the system more adaptable and flexible to the development of the next-generation hardware and software with larger scales. Extensive performance evaluations using SystemC-based cycle-accurate simulations on a set of real-world applications show that our approach has on average 49% performance improvement and 79.6% energy consumption reduction compared with the related state-of-the-art techniques, and hardware synthesis results show that our approach only introduces 2.9% chip area overheads.

[1]  Dhiraj K. Pradhan,et al.  Fault-tolerant computer system design , 1996 .

[2]  Flávio Rech Wagner,et al.  Impact of task migration in NoC-based MPSoCs for soft real-time applications , 2007, 2007 IFIP International Conference on Very Large Scale Integration.

[3]  Jörg Henkel,et al.  A design methodology for application-specific networks-on-chip , 2006, TECS.

[4]  Yi Wang,et al.  Design of a scalable RF microarchitecture for heterogeneous MPSoCs , 2012, 2012 IEEE International SOC Conference.

[5]  Krishnendu Chakrabarty,et al.  Soft error-aware design optimization of low power and time-constrained embedded systems , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[6]  L. Artola,et al.  Continuous High-Altitude Measurements of Cosmic Ray Neutrons and SEU/MCU at Various Locations: Correlation and Analyses Based-On MUSCA SEP$^{3}$ , 2013, IEEE Transactions on Nuclear Science.

[7]  Wei Zhang,et al.  A NoC Traffic Suite Based on Real Applications , 2011, 2011 IEEE Computer Society Annual Symposium on VLSI.

[8]  Prabhakar Kudva,et al.  Soft-error resilience of the IBM POWER6 processor input/output subsystem , 2008, IBM J. Res. Dev..

[9]  Viswanathan Subramanian,et al.  Low overhead Soft Error Mitigation techniques for high-performance and aggressive systems , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[10]  D. Lattard,et al.  A semi-distributed control system for application management in a NoC-based architecture , 2006, 2006 NORCHIP.

[11]  Wei Qin,et al.  Prototyping a fault-tolerant multiprocessor SoC with run-time fault recovery , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[12]  Rami G. Melhem,et al.  Fault-Tolerance Through Scheduling of Aperiodic Tasks in Hard Real-Time Multiprocessor Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[13]  Cristian Constantinescu Neutron SER characterization of microprocessors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[14]  Mohamed Abid,et al.  Master-Slave Control Structure for Massively Parallel System on Chip , 2013, 2013 Euromicro Conference on Digital System Design.

[15]  Babak Falsafi,et al.  Reunion: Complexity-Effective Multicore Redundancy , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[16]  Wei Zhang,et al.  A Hardware-Software Collaborated Method for Soft-Error Tolerant MPSoC , 2011, 2011 IEEE Computer Society Annual Symposium on VLSI.

[17]  Jörg Henkel,et al.  SEAL: Soft error aware low power scheduling by Monte Carlo state space under the influence of stochastic spatial and temporal dependencies , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[18]  Xu Wang,et al.  A Quantitative Study of the On-Chip Network and Memory Hierarchy Design for Many-Core Processor , 2008, 2008 14th IEEE International Conference on Parallel and Distributed Systems.

[19]  Davide Bertozzi,et al.  Supporting Task Migration in Multi-Processor Systems-on-Chip: A Feasibility Study , 2006, Proceedings of the Design Automation & Test in Europe Conference.

[20]  M. Nicolaidis,et al.  Design for soft error mitigation , 2005, IEEE Transactions on Device and Materials Reliability.

[21]  Petru Eles,et al.  Analysis and optimization of fault-tolerant embedded systems with hardened processors , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[22]  Ming Zhang,et al.  Combinational Logic Soft Error Correction , 2006, 2006 IEEE International Test Conference.

[23]  Onur Derin,et al.  Online task remapping strategies for fault-tolerant Network-on-Chip multiprocessors , 2011, Proceedings of the Fifth ACM/IEEE International Symposium.

[24]  Théodore Marescaux,et al.  Centralized run-time resource management in a network-on-chip containing reconfigurable hardware tiles , 2005, Design, Automation and Test in Europe.

[25]  Wenchao Li,et al.  Verification-guided soft error resilience , 2007 .

[26]  Weichen Liu,et al.  An efficient soft error protection scheme for MPSoC and FPGA-based verification , 2012, Anti-counterfeiting, Security, and Identification.

[27]  Qing Wu,et al.  A Multi-Agent Framework for Thermal Aware Task Migration in Many-Core Systems , 2012, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[28]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[29]  Benoît Dupont de Dinechin,et al.  A clustered manycore processor architecture for embedded and accelerated applications , 2013, 2013 IEEE High Performance Extreme Computing Conference (HPEC).

[30]  Viswanathan Subramanian,et al.  Superscalar Processor Performance Enhancement through Reliable Dynamic Clock Frequency Tuning , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[31]  D. Jayasimha,et al.  On-Chip Interconnection Networks : Why They are Different and How to Compare Them , 2007 .