Fault tolerance for highly available internet services: concepts, approaches, and issues

Fault-tolerant frameworks provide highly available services by means of fault detection and fault recovery mechanisms. These frameworks need to meet different constraints related to the fault model strength, performance, and resource consumption. One of the factors that led to this work is the observation that current fault-tolerant frameworks are not always adapted to existing Internet services. In fact, most of the proposed frameworks are not transport-level- or session-level-aware, although the concerned services range from regular services like HTTP and FTP to more recent Internet services such as multimodal conferencing and voice over IP. In this work we give a comprehensive overview of fault tolerance concepts, approaches, and issues. We show how the redundancy of application servers can be invested to ensure efficient failover of Internet services when the legitimate processing server goes down.

[1]  Manish Marwah,et al.  A system demonstration of ST-TCP , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[2]  Douglas Clark Schmidt,et al.  Leader / Followers A Design Pattern for Efficient Multi-threaded I / O Demultiplexing and Dispatching , 2000 .

[3]  James E. J. Bottomley Implementing clusters for high availability , 2004 .

[4]  Lorenzo Alvisi,et al.  Engineering fault-tolerant TCP/IP servers using FT-TCP , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[5]  Louise E. Moser,et al.  Transparent TCP connection failover , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[6]  Wanlei Zhou,et al.  The design and implementation of an active replication scheme for distributing services in a cluster of workstations , 2001, J. Syst. Softw..

[7]  J-C. Laprie,et al.  DEPENDABLE COMPUTING AND FAULT TOLERANCE : CONCEPTS AND TERMINOLOGY , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[8]  Karl Kopper The Linux Enterprise Cluster: Build a Highly Available Cluster with Commodity Hardware and Free Software , 2005 .

[9]  Hermann Kopetz,et al.  Fault tolerance, principles and practice , 1990 .

[10]  Thomas Anderson,et al.  Fault Tolerance Terminology Proposals , 1985 .

[11]  Stephen E. Deering,et al.  Host extensions for IP multicasting , 1986, RFC.

[12]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[13]  Fred B. Schneider,et al.  Byzantine generals in action: implementing fail-stop processors , 1984, TOCS.

[14]  Willy Zwaenepoel,et al.  On the use and implementation of message logging , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[15]  Miguel Castro,et al.  Practical byzantine fault tolerance and proactive recovery , 2002, TOCS.

[16]  Miguel Oom Temudo de Castro,et al.  Practical Byzantine fault tolerance , 1999, OSDI '99.

[17]  Yuval Tamir,et al.  Transparent fault-tolerant network services using off-the-shelf components , 2005 .

[18]  Jason Nieh,et al.  Transparent Checkpoint-Restart of Distributed Applications on Commodity Clusters , 2005, 2005 IEEE International Conference on Cluster Computing.

[19]  Christof Fetzer,et al.  Perfect Failure Detection in Timed Asynchronous Systems , 2003, IEEE Trans. Computers.

[20]  Mazin S. Yousif Shared‐storage clusters , 1999, Cluster Computing.

[21]  Torres Wilfredo,et al.  Software Fault Tolerance: A Tutorial , 2000 .

[22]  Manish Marwah,et al.  TPC server fault tolerance using connection migration to a backup server , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[23]  Tony Li,et al.  Cisco Hot Standby Router Protocol (HSRP) , 1998, RFC.

[24]  Neeraj Suri,et al.  Practical Aspects of IP Take-Over Mechanisms , 2003, 2003 The Ninth IEEE International Workshop on Object-Oriented Real-Time Dependable Systems.

[25]  Yi-Min Wang,et al.  Why optimistic message logging has not been used in telecommunications systems , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.