Minimizing Mean-Time-to-Recover in a Recursively Restartable Software System

This paper presents ideas on how to structure software systems for high availability by considering MTTR/MTTF characteristics of components in addition to the traditional criteria, such as functionality or state sharing. Recursive restartability (RR), a recently proposed technique for achieving high availability, exploits partial restarts at various levels within complex software infrastructures to recover from transient failures and rejuvenate software components. Here we refine the original proposal and apply the RR philosophy to Mercury, a COTS-based university satellite ground station that has been in operation for over 2 years. We develop four techniques for transforming component group boundaries such that time-to-repair is reduced, hence increasing system availability. We also further RR by defining the notions of an oracle, restart tree and restart group, while showing how to reason about system properties in terms of restart groups. From our experience with applying RR to Mercury, we draw design guidelines and lessons for the systematic application of recursive restartability to other software systems amenable to RR.

[1]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[2]  William J. Bolosky,et al.  Mach: A New Kernel Foundation for UNIX Development , 1986, USENIX Summer.

[3]  Deborah Estrin,et al.  RSVP: a new resource ReSerVation Protocol , 1993 .

[4]  Eric A. Brewer,et al.  System support for scalable and fault tolerant Internet services , 1999, Distributed Syst. Eng..

[5]  Amin Vahdat,et al.  Design and evaluation of a continuous consistency model for replicated services , 2000, OSDI.

[6]  David E. Culler,et al.  Scalable, distributed data structures for internet service construction , 2000, OSDI.

[7]  Jim Gray,et al.  Notes on Data Base Operating Systems , 1978, Advanced Course: Operating Systems.

[8]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .

[9]  Jochen Liedtke,et al.  Toward real microkernels , 1996, CACM.

[10]  Yennun Huang,et al.  Software rejuvenation: analysis, module and applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[11]  Kishor S. Trivedi,et al.  Analysis of software rejuvenation using Markov Regenerative Stochastic Petri Net , 1995, Proceedings of Sixth International Symposium on Software Reliability Engineering. ISSRE'95.

[12]  G. E. Reeves,et al.  What Really Happened on Mars , 1998 .

[13]  Eric A. Brewer,et al.  Harvest, yield, and scalable tolerant systems , 1999, Proceedings of the Seventh Workshop on Hot Topics in Operating Systems.

[14]  Armando Fox,et al.  Applying the lessons of Internet services to space systems , 2002, Proceedings, IEEE Aerospace Conference.

[15]  Steven McCanne,et al.  A reliable multicast framework for light-weight sessions and application level framing , 1995, SIGCOMM '95.

[16]  George Candea,et al.  Recursive restartability: turning the reboot sledgehammer into a scalpel , 2001, Proceedings Eighth Workshop on Hot Topics in Operating Systems.

[17]  Marvin Theimer,et al.  Session guarantees for weakly consistent replicated data , 1994, Proceedings of 3rd International Conference on Parallel and Distributed Information Systems.