Optimal Checkpoint Selection with Dual-Modular Redundancy Hardening

With the continuous scaling of semiconductor technology, failure rate is increasing significantly so that reliability becomes an important issue in multiprocessor system-on-chip (MPSoC) design. We propose an optimal checkpoint selection with task duplication hardening to tolerate transient faults. A target application is specified in a task graph, and the schedule/checkpoint placements are determined at design time. The proposed optimal algorithm minimizes the checkpoint overhead with a latency constraint. Experimental results show that the proposed algorithm effectively reduces the minimum end-to-end latency to perform a fault-tolerant schedule. In addition, the proposed algorithm dramatically decreases the checkpointing overhead on uniprocessor and multiprocessor systems compared with a greedy approach and an equidistant algorithm.

[1]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[2]  Hiroyuki Sugiyama,et al.  A 1.3 GHz fifth generation SPARC64 microprocessor , 2003, 2003 IEEE International Solid-State Circuits Conference, 2003. Digest of Technical Papers. ISSCC..

[3]  Jakob Engblom,et al.  The worst-case execution-time problem—overview of methods and survey of tools , 2008, TECS.

[4]  Alan Burns,et al.  Guest Editorial: A Review of Worst-Case Execution-Time Analysis , 2000, Real-Time Systems.

[5]  Soonhoi Ha,et al.  Conversion of reference C code to dataflow model H.264 encoder case study , 2006, Asia and South Pacific Conference on Design Automation, 2006..

[6]  Virendra Singh,et al.  On-line Techniques to Adjust and Optimize Checkpointing Frequency , 2010 .

[7]  Cecilia Metra,et al.  Function-Inherent Code Checking: A New Low Cost On-Line Testing Approach for High Performance Microprocessor Control Logic , 2008, 2008 13th European Test Symposium.

[8]  Kang G. Shin,et al.  A Fault-Tolerant Scheduling Algorithm for Real-Time Periodic Tasks with Possible Software Faults , 2003, IEEE Trans. Computers.

[9]  Alberto L. Sangiovanni-Vincentelli,et al.  Fault-tolerant platforms for automotive safety-critical applications , 2003, CASES '03.

[10]  Alan Burns,et al.  Analysis of checkpointing for schedulability of real-time systems , 1997, Proceedings Fourth International Workshop on Real-Time Computing Systems and Applications.

[11]  Niraj K. Jha,et al.  COFTA : Hardware-Software Co-Synthesis of Heterogeneous Distributed Embedded Systems for Low Overhead Fault Tolerance , 1999 .

[12]  Sander Stuijk,et al.  SDF^3: SDF For Free , 2006, Sixth International Conference on Application of Concurrency to System Design (ACSD'06).

[13]  Cristiana Bolchini,et al.  Reliability-Driven System-Level Synthesis for Mixed-Critical Embedded Systems , 2013, IEEE Transactions on Computers.

[14]  Lothar Thiele,et al.  Distributed stable states for process networks - Algorithm, analysis, and experiments on intel SCC , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[15]  Robert E. Lyons,et al.  The Use of Triple-Modular Redundancy to Improve Computer Reliability , 1962, IBM J. Res. Dev..

[16]  Edward J. McCluskey,et al.  Word-voter: a new voter design for triple modular redundant systems , 2000, Proceedings 18th IEEE VLSI Test Symposium.

[17]  S. Vestal Preemptive Scheduling of Multi-criticality Systems with Varying Degrees of Execution Time Assurance , 2007, RTSS 2007.

[18]  Edward A. Lee,et al.  Synthesis of Embedded Software from Synchronous Dataflow Specifications , 1999, J. VLSI Signal Process..

[19]  Rudy Lauwereins,et al.  Cyclo-static data flow , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[20]  Norbert Wehn,et al.  Reliable on-chip systems in the nano-era: Lessons learnt and future trends , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[21]  Nuno Pereira,et al.  Static-Priority Scheduling over Wireless Networks with Multiple Broadcast Domains , 2007, RTSS 2007.

[22]  Edward A. Lee,et al.  Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing , 1989, IEEE Transactions on Computers.

[23]  Massimo Violante,et al.  Soft-error detection using control flow assertions , 2003, Proceedings 18th IEEE Symposium on Defect and Fault Tolerance in VLSI Systems.

[24]  Roman Obermaisser,et al.  From a Federated to an Integrated Automotive Architecture , 2008 .

[25]  Nagarajan Kandasamy,et al.  Transparent recovery from intermittent faults in time-triggered distributed systems , 2003 .

[26]  David Cummings,et al.  Checkpoint/rollback in a distributed system using coarse-grained dataflow , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[27]  Sharad Malik,et al.  Extracting useful computation from error-prone processors for streaming applications , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[28]  Dhiraj K. Pradhan,et al.  Roll-Forward Checkpointing Scheme: A Novel Fault-Tolerant Architecture , 1994, IEEE Trans. Computers.

[29]  Petru Eles,et al.  Design Optimization of Time- and Cost-Constrained Fault-Tolerant Embedded Systems With Checkpointing and Replication , 2009, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[30]  Paraskevas Evripidou,et al.  Fault detection and recovery in a data-driven real-time multiprocessor , 1994, Proceedings of 8th International Parallel Processing Symposium.

[31]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[32]  Alan Wood,et al.  The impact of new technology on soft error rates , 2011, 2011 International Reliability Physics Symposium.

[33]  Edward J. McCluskey,et al.  Control-flow checking by software signatures , 2002, IEEE Trans. Reliab..

[34]  Cristian Constantinescu,et al.  Trends and Challenges in VLSI Circuit Reliability , 2003, IEEE Micro.

[35]  Hong Chen,et al.  Performance Optimization of Checkpointing Schemes with Task Duplication , 2006, First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06).

[36]  Muhammad Shafique,et al.  Power-efficient error-resiliency for H.264/AVC Context-Adaptive Variable Length Coding , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[37]  Zhan Zhang,et al.  The Checkpoint Interval Optimization of Kernel-Level Rollback Recovery Based on the Embedded Mobile Computing System , 2008, 2008 IEEE 8th International Conference on Computer and Information Technology Workshops.

[38]  Petru Eles,et al.  Scheduling of Fault-Tolerant Embedded Systems with Soft and Hard Timing Constraints , 2008, 2008 Design, Automation and Test in Europe.

[39]  Chin-Long Chen,et al.  Error-Correcting Codes for Semiconductor Memory Applications: A State-of-the-Art Review , 1984, IBM J. Res. Dev..

[40]  Xianfeng Li,et al.  Modeling out-of-order processors for WCET analysis , 2006, Real-Time Systems.

[41]  Bruce Jacob,et al.  Memory Systems: Cache, DRAM, Disk , 2007 .

[42]  Luigi V. Mancini,et al.  Scheduling algorithms for fault-tolerance in hard-real-time systems , 1994, Real-Time Systems.

[43]  Cecilia Metra,et al.  Error correcting code analysis for cache memory high reliability and performance , 2011, 2011 Design, Automation & Test in Europe.

[44]  Hermann Kopetz,et al.  The time-triggered architecture , 1998, Proceedings First International Symposium on Object-Oriented Real-Time Distributed Computing (ISORC '98).

[45]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.

[46]  Soonhoi Ha,et al.  Fractional Rate Dataflow Model for Efficient Code Synthesis , 2004, J. VLSI Signal Process..

[47]  Shangping Ren,et al.  Adaptive optimal checkpoint interval and its impact on system's overall quality in soft real-time applications , 2009, SAC '09.

[48]  Edward J. McCluskey,et al.  Concurrent Error Detection Using Watchdog Processors - A Survey , 1988, IEEE Trans. Computers.

[49]  William Thies,et al.  An empirical characterization of stream programs and its implications for language and compiler design , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).