Improving Availability of Multicore Real-Time Systems Suffering Both Permanent and Transient Faults

CMOS scaling has greatly increased concerns for both lifetime reliability due to permanent faults and soft-error reliability due to transient faults. Most existing works only focus on one of the two reliability concerns, but often times techniques used to increase one type of reliability may adversely impact the other type. A few efforts do consider both types of reliability together and use two different metrics to quantify the two types of reliability. However, for many systems, the user's concern is to maximize system availability by improving the mean time to failure (MTTF), regardless of whether the failure is caused by permanent or transient faults. Addressing this concern requires a uniform metric to measure the effect due to both types of faults. This paper introduces a novel analytical expression for calculating the MTTF due to transient faults. Using this new formula and an existing method to evaluate system MTTF, we tackle the problem of maximizing availability for multicore real-time systems with consideration of permanent and transient faults. A framework is proposed to solve the system availability maximization problem. Experimental results on a hardware board and simulation results of synthetic tasks show that our scheme significantly improves system MTTF (and hence availability) compared with existing techniques.

[1]  Xiaobo Sharon Hu,et al.  Temperature-Aware Scheduling and Assignment for Hard Real-Time Applications on MPSoCs , 2011, IEEE Trans. Very Large Scale Integr. Syst..

[2]  Tongquan Wei,et al.  Stochastic thermal-aware real-time task scheduling with considerations of soft errors , 2015, J. Syst. Softw..

[3]  Tongquan Wei,et al.  Balancing lifetime and soft-error reliability to improve system availability , 2016, 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC).

[4]  Muhammad Shafique,et al.  Variability and Reliability Awareness in the Age of Dark Silicon , 2016, IEEE Design & Test.

[5]  Dakai Zhu,et al.  On Maximizing Reliability of Real-Time Embedded Applications Under Hard Energy Constraint , 2010, IEEE Transactions on Industrial Informatics.

[6]  Wang Yi,et al.  General and efficient Response Time Analysis for EDF scheduling , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[7]  Alois Knoll,et al.  Analysis and optimization of fault-tolerant task scheduling on multiprocessor embedded systems , 2011, 2011 Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[8]  Alireza Ejlali,et al.  DRVS: Power-efficient reliability management through Dynamic Redundancy and Voltage Scaling under variations , 2015, 2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).

[9]  Junlong Zhou,et al.  Thermal-Aware Task Scheduling for Energy Minimization in Heterogeneous Real-Time MPSoC Systems , 2016, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[10]  Qiang Xu,et al.  On Task Allocation and Scheduling for Lifetime Extension of Platform-Based MPSoC Designs , 2011, IEEE Transactions on Parallel and Distributed Systems.

[11]  Dakai Zhu,et al.  On Reliability Management of Energy-Aware Real-Time Systems Through Task Replication , 2017, IEEE Transactions on Parallel and Distributed Systems.

[12]  Jörg Henkel,et al.  Towards interdependencies of aging mechanisms , 2014, 2014 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[13]  Davide Bertozzi,et al.  Supporting Task Migration in Multi-Processor Systems-on-Chip: A Feasibility Study , 2006, Proceedings of the Design Automation & Test in Europe Conference.

[14]  Shaolei Ren,et al.  Performance Maximization via Frequency Oscillation on Temperature Constrained Multi-core Processors , 2016, 2016 45th International Conference on Parallel Processing (ICPP).

[15]  Alireza Ejlali,et al.  A Comparative Study of System-Level Energy Management Methods for Fault-Tolerant Hard Real-Time Systems , 2011, IEEE Transactions on Computers.

[16]  Bharadwaj Veeravalli,et al.  Combined DVFS and mapping exploration for lifetime and soft-error susceptibility improvement in MPSoCs , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[17]  Kevin Skadron,et al.  Temperature-aware microarchitecture: Modeling and implementation , 2004, TACO.

[18]  Hananeh Aliee,et al.  Reliability Analysis and Optimization of Embedded Systems using Stochastic Logic and Importance Measures , 2017 .

[19]  Rami Melhem,et al.  The effects of energy management on reliability in real-time embedded systems , 2004, ICCAD 2004.

[20]  Bharadwaj Veeravalli,et al.  Reliability-driven task mapping for lifetime extension of networks-on-chip based multiprocessor systems , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[21]  Tongquan Wei,et al.  Fixed-Priority Allocation and Scheduling for Energy-Efficient Fault Tolerance in Hard Real-Time Multiprocessor Systems , 2008, IEEE Transactions on Parallel and Distributed Systems.

[22]  Hoang Pham,et al.  System Software Reliability , 1999 .

[23]  Pradip Bose,et al.  Exploiting structural duplication for lifetime reliability enhancement , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[24]  Michael Glaß,et al.  Automatic success tree-based reliability analysis for the consideration of transient and permanent faults , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[25]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor , 1999, IEEE Micro.

[26]  Xiaobo Sharon Hu,et al.  Improving Lifetime of Multicore Soft Real-Time Systems through Global Utilization Control , 2015, ACM Great Lakes Symposium on VLSI.

[27]  Jörg Henkel,et al.  Aging Resilience and Fault Tolerance in Runtime Reconfigurable Architectures , 2017, IEEE Transactions on Computers.

[28]  Alan Burns,et al.  Schedulability Analysis for Real-Time Systems with EDF Scheduling , 2009, IEEE Transactions on Computers.

[29]  Sheldon X.-D. Tan,et al.  Energy and Lifetime Optimizations for Dark Silicon Manycore Microprocessor Considering Both Hard and Soft Errors , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[30]  Chung Laung Liu,et al.  Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment , 1989, JACM.

[31]  Wolfgang Fichtner,et al.  Lifetime prediction and design of reliability tests for high-power devices in automotive applications , 2003 .

[32]  Sarita V. Adve,et al.  The impact of technology scaling on lifetime reliability , 2004, International Conference on Dependable Systems and Networks, 2004.

[33]  Xiaobo Sharon Hu,et al.  Improving System-Level Lifetime Reliability of Multicore Soft Real-Time Systems , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[34]  Chengmo Yang,et al.  Improving MPSoC reliability through adapting runtime task schedule based on time-correlated fault behavior , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[35]  Tulika Mitra,et al.  Temperature aware task sequencing and voltage scaling , 2008, ICCAD 2008.

[36]  Rolf Ernst,et al.  Reliability analysis for MPSoCs with mixed-critical, hard real-time constraints , 2011, 2011 Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[37]  Jin Sun,et al.  Resource Management for Improving Soft-Error and Lifetime Reliability of Real-Time MPSoCs , 2019, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[38]  Xiaobo Sharon Hu,et al.  Enhancing multicore reliability through wear compensation in online assignment and scheduling , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[39]  Xiaodong Li,et al.  Online Estimation of Architectural Vulnerability Factor for Soft Errors , 2008, 2008 International Symposium on Computer Architecture.

[40]  Li Shang,et al.  System-level reliability modeling for MPSoCs , 2010, 2010 IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[41]  Xiaobo Sharon Hu,et al.  An on-line framework for improving reliability of real-time systems on “big-little” type MPSoCs , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[42]  Marco Spuri,et al.  Deadline Scheduling for Real-Time Systems: Edf and Related Algorithms , 2013 .

[43]  Jian-Jia Chen,et al.  Thermal-aware lifetime reliability in multicore systems , 2010, 2010 11th International Symposium on Quality Electronic Design (ISQED).

[44]  Radu Marculescu,et al.  FARM: Fault-aware resource management in NoC-based multiprocessor platforms , 2011, 2011 Design, Automation & Test in Europe.

[45]  Sudhanva Gurumurthi,et al.  Balancing soft error coverage with lifetime reliability in redundantly multithreaded processors , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[46]  David R. Kaeli,et al.  Using hardware vulnerability factors to enhance AVF analysis , 2010, ISCA.