Fault tolerance in a very large database system: a strawman analysis

A simple model is used to study the effect of fault-tolerance techniques and system design on system availability. A generic multiprocessor architecture is used that can be configured in different ways to study the effect of system architectures. Important parameters studied are different system architectures and hardware fault-tolerance techniques, mean time to failure of basic components, database size and distribution, interconnect capacity, etc. Quantitative analysis compares the relative effect of different parameter values. Results show that the effect of different parameter values on system availability can be very significant. System architecture, use of hardware fault tolerance (particularly mirroring), and data storage methods emerge as very important parameters under the control of a system designer.<<ETX>>

[1]  Andrea J. Borr Transaction Monitoring in ENCOMPASS: Reliable Distributed Transaction Processing , 1981, VLDB.

[2]  Kamran Parsaye A Fault-Tolerant Transaction Processing Environment. , 1983 .

[3]  Algirdas Avizienis,et al.  Fault Tolerance by Design Diversity: Concepts and Experiments , 1984, Computer.

[4]  Amit P. Sheth,et al.  Performance analysis of resiliency mechanisms in distributed database systems , 1987, 1987 IEEE Third International Conference on Data Engineering.

[5]  D.P. Siewiorek,et al.  A case study of C.mmp, Cm*, and C.vmp: Part I—Experiences with fault tolerance in multiprocessor systems , 1978, Proceedings of the IEEE.

[6]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[7]  Amit P. Sheth,et al.  An Analysis of the Effect of Network Parameters on the Performance of Distributed Database Systems , 1985, IEEE Transactions on Software Engineering.

[8]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[9]  Robert S. Swarz,et al.  The theory and practice of reliable system design , 1982 .

[10]  M. Tamer Özsu,et al.  Performance comparison of resilient concurrency control algorithms for distributed databases , 1986, 1986 IEEE Second International Conference on Data Engineering.

[11]  Won Kim Highly available systems for database applications , 1984, CSUR.

[12]  Andreas Reuter,et al.  Principles of transaction-oriented database recovery , 1983, CSUR.

[13]  Daniel P. Siewiorek Architecture of Fault-Tolerant Computers , 1984, Computer.