Fault-Tolerant Computing

This chapter provides an overview of fault-tolerant computing. Fault-tolerant computing can be defined as the process by which a computing system continues to perform its specified tasks correctly in the presence of faults with the goal of improving the dependability of the system. Principles of fault-tolerant computing as well as various fault-tolerant architectures are discussed. The article concludes by observing trends in the fault-tolerant computing. Keywords: fault tolerance; reliability; availability; coverage; dependability; redundancy

[1]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[2]  Israel Koren,et al.  Techniques for transient fault sensitivity analysis and reduction in VLSI circuits , 2003, Proceedings 18th IEEE Symposium on Defect and Fault Tolerance in VLSI Systems.

[3]  Dhiraj K. Pradhan,et al.  Fault-Tolerant Design Strategies for High Reliability and Safety , 1993, IEEE Trans. Computers.

[4]  Barry W. Johnson,et al.  Reliability modeling of hardware/software systems , 1995 .

[5]  Dhiraj K. Pradhan,et al.  Roll-Forward Checkpointing Scheme: A Novel Fault-Tolerant Architecture , 1994, IEEE Trans. Computers.

[6]  Edward J. McCluskey,et al.  Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..

[7]  G. Manimaran,et al.  An adaptive scheme for fault-tolerant scheduling of soft real-time tasks in multiprocessor systems , 2001, J. Parallel Distributed Comput..

[8]  Ravishankar K. Iyer,et al.  Measurement-Based Analysis of Error Latency , 1987, IEEE Transactions on Computers.

[9]  Dhiraj K. Pradhan,et al.  Processor- and memory-based checkpoint and rollback recovery , 1993, Computer.

[10]  Alan Messer,et al.  Susceptibility of commodity systems and software to memory soft errors , 2004, IEEE Transactions on Computers.

[11]  Barry W. Johnson Design and Analysis of Fault-Tolerant Systems for Industrial Applications , 1989, Fehlertolerierende Rechensysteme.

[12]  Nitin H. Vaidya,et al.  Understanding Fault Tolerance And Reliability , 1997, Computer.

[13]  Evanthia Papadopoulou Critical area computation for missing material defects in VLSIcircuits , 2001, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[14]  Chung-Ho Chen,et al.  Fault Containment in Cache Memories for TMR Redundant Processor Systems , 1999, IEEE Trans. Computers.