Two-State Checkpointing for Energy-Efficient Fault Tolerance in Hard Real-Time Systems

Checkpointing with rollback recovery is a well-established technique to tolerate transient faults. However, it incurs significant time and energy overheads, which go wasted in fault-free execution states and may not even be feasible in hard real-time systems. This paper presents a low-overhead two-state checkpointing (TsCp) scheme for fault-tolerant hard real-time systems. It differentiates between the fault-free and faulty execution states and leverages two types of checkpoint intervals for these two different states. The first type is nonuniform intervals that are used while no fault has occurred. These intervals are determined based on postponing checkpoint insertions in fault-free states, with the aim of decreasing the number of checkpoint insertions. The second type is uniform intervals that are used from the time when the first fault occurs. They are determined so as to minimize execution time for faulty states, leaving more time available for energy management in fault-free states. Experimental evaluation on an embedded processor (LEON3) and an emerging nonvolatile memory technology (ReRAM) illustrates that TsCp significantly reduces the number of checkpoints (62% on average) compared with previous works, while preserving fault tolerance. This results in 14% and 13% reduced execution time and energy consumption, respectively. Furthermore, we combine TsCp with dynamic voltage scaling (DVS) and achieve up to 26% (21% on average) energy saving compared with the state-of-the-art techniques.

[1]  Alireza Ejlali,et al.  Feedback-Based Energy Management in a Standby-Sparing Scheme for Hard Real-Time Systems , 2011, 2011 IEEE 32nd Real-Time Systems Symposium.

[2]  Hermann Kopetz,et al.  Real-time systems , 2018, CSC '73.

[3]  Rami G. Melhem,et al.  The interplay of power management and fault recovery in real-time systems , 2004, IEEE Transactions on Computers.

[4]  Muhammad Shafique,et al.  RASTER: Runtime adaptive spatial/temporal error resiliency for embedded processors , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[5]  Norbert Wehn,et al.  Reliable on-chip systems in the nano-era: Lessons learnt and future trends , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[6]  Alireza Ejlali,et al.  A Hardware Platform for Evaluating Low-Energy Multiprocessor Embedded Systems Based on COTS Devices , 2015, IEEE Transactions on Industrial Electronics.

[7]  Farinaz Koushanfar,et al.  Automated checkpointing for enabling intensive applications on energy harvesting devices , 2013, International Symposium on Low Power Electronics and Design (ISLPED).

[8]  Gang Quan,et al.  Energy minimization for checkpointing-based approach to guaranteeing real-time systems reliability , 2013, 16th IEEE International Symposium on Object/component/service-oriented Real-time distributed Computing (ISORC 2013).

[9]  Bashir M. Al-Hashimi,et al.  Two-Phase Low-Energy N-Modular Redundancy for Hard Real-Time Multi-Core Systems , 2016, IEEE Transactions on Parallel and Distributed Systems.

[10]  Jung-Min Yang,et al.  Optimal Checkpoint Placement on Real-Time Tasks with Harmonic Periods , 2012, Journal of Computer Science and Technology.

[11]  Rami Melhem,et al.  The effects of energy management on reliability in real-time embedded systems , 2004, ICCAD 2004.

[12]  Jung-Min Yang,et al.  Probabilistic optimisation of checkpoint intervals for real-time multi-tasks , 2013, Int. J. Syst. Sci..

[13]  Albert Meixner,et al.  Argus: Low-Cost, Comprehensive Error Detection in Simple Cores , 2008, IEEE Micro.

[14]  Erol Gelenbe,et al.  On the Optimum Checkpoint Interval , 1979, JACM.

[15]  Alan Burns,et al.  Analysis of Checkpointing for Real-Time Systems , 2004, Real-Time Systems.

[16]  Yang Xiao,et al.  Low power memristor-based ReRAM design with Error Correcting Code , 2012, 17th Asia and South Pacific Design Automation Conference.

[17]  Cong Xu,et al.  NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[18]  Maziar Goudarzi,et al.  Simultaneous hardware and time redundancy with online task scheduling for low energy highly reliable standby-sparing system , 2014, ACM Trans. Embed. Comput. Syst..

[19]  Gang Quan,et al.  Energy minimization for fault tolerant real-time applications on multiprocessor platforms using checkpointing , 2013, International Symposium on Low Power Electronics and Design (ISLPED).

[20]  Jong Kim,et al.  An Adaptive DVS Checkpointing Scheme for Fixed-Priority Tasks with Reliability Constraints in Dependable Real-Time Embedded Systems , 2007, ICESS.

[21]  Alireza Ejlali,et al.  DRVS: Power-efficient reliability management through Dynamic Redundancy and Voltage Scaling under variations , 2015, 2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).

[22]  Petru Eles,et al.  Design Optimization of Time- and Cost-Constrained Fault-Tolerant Embedded Systems With Checkpointing and Replication , 2009, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[23]  Gary S. Tyson,et al.  Guaranteeing Hits to Improve the Efficiency of a Small Instruction Cache , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[24]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[25]  Dhiraj K. Pradhan,et al.  Lifetime Reliability-Aware Checkpointing Mechanism: Modelling and Analysis , 2013, 2013 International Symposium on Electronic System Design.

[26]  Ying Zhang,et al.  Dynamic adaptation for fault tolerance and power management in embedded real-time systems , 2004, TECS.

[27]  Muhammad Shafique,et al.  Reliable software for unreliable hardware: Embedded code generation aiming at reliability , 2011, 2011 Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[28]  Petru Eles,et al.  Low-Energy Standby-Sparing for Hard Real-Time Systems , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[29]  Dakai Zhu Reliability-Aware Dynamic Energy Management in Dependable Embedded Real-Time Systems , 2006, IEEE Real Time Technology and Applications Symposium.

[30]  Byung Kook Kim,et al.  An optimal checkpointing-strategy for real-time control systems under transient faults , 2001, IEEE Trans. Reliab..

[31]  Jacob A. Abraham,et al.  CEDA: Control-Flow Error Detection Using Assertions , 2011, IEEE Transactions on Computers.

[32]  Ying Zhang,et al.  A unified approach for fault tolerance and dynamic power management in fixed-priority real-time embedded systems , 2006, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[33]  Bashir M. Al-Hashimi,et al.  Combined time and information redundancy for SEU-tolerance in energy-efficient real-time systems , 2006, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[34]  Ying Zhang,et al.  Task feasibility analysis and dynamic voltage scaling in fault-tolerant real-time embedded systems , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[35]  Sri Parameswaran,et al.  Reli: Hardware/software Checkpoint and Recovery scheme for embedded processors , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).