Specifying graceful degradation in distributed systems

Distributed programs must often display graceful degradation, reacting adaptively to changes in the environment. Under ideal circumstances, the program’s behavior satisfies a set of application-dependent constraints. In the presence of failures, timing anomalies, or synchronization conflicts, however, certain constraints may become difficult or impossible to Satisfy, and the application designer may choose to relax them as long as the resulting behavior is sufficiently “close” to the preferred behavior. This paper describes the relaxation lattice method, a new approach to specifying graceful degradation for a large class of highly-concurrent fault-tolerant distributed programs. A relaxation lattice is a lattice of specifications parameterized by a set of constraints, where the stronger the set of constraints, the more restrictive the specification. While a program is able to satisfy its strongest set of constraints, it satisfies its preferred specification, but if changes to the environment force it to satisfy a weaker set, then it will permit additional “weakly consistent” computations which are undesired but tolerated. The use of relaxation lattices is illustrated by specifications for programs that tolerate (1) faults, such as site crashes and network partitions, (2) timing anomalies, such as attempting to read a value “too soon” after it was written, and (3) synchronization conflicts, such as choosing the oldest “unlocked” item from a queue.

[1]  Alfred Z. Spector,et al.  Synchronizing shared abstract types , 1984, TOCS.

[2]  David K. Gifford,et al.  Weighted voting for replicated data , 1979, SOSP '79.

[3]  K. Brown,et al.  Graduate Texts in Mathematics , 1982 .

[4]  Irving L. Traiger,et al.  The notions of consistency and predicate locks in a database system , 1976, CACM.

[5]  Jo-Mei Chang,et al.  Reliable broadcast protocols , 1984, TOCS.

[6]  M. Herlihy A quorum-consensus replication method for abstract data types , 1986, TOCS.

[7]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[8]  Michael J. Fischer,et al.  Sacrificing serializability to attain high availability of data in an unreliable network , 1982, PODS.

[9]  James J. Horning,et al.  The Larch Family of Specification Languages , 1985, IEEE Software.

[10]  Hector Garcia-Molina,et al.  Using semantic knowledge for transaction processing in a distributed database , 1983, TODS.

[11]  T. S. E. Maibaum,et al.  Large Database Specification from Small Views , 1985, FSTTCS.

[12]  Kenneth P. Birman,et al.  Replication and fault-tolerance in the ISIS system , 1985, SOSP '85.

[13]  Roger M. Needham,et al.  Grapevine: an exercise in distributed computing , 1982, CACM.

[14]  Philip A. Bernstein,et al.  The failure and recovery problem for replicated databases , 1983, PODC '83.

[15]  William E. Weihl,et al.  Specification and implementation of resilient, atomic data types , 1983, ACM SIGPLAN Notices.

[16]  Mary Shaw,et al.  Specifying reliability as a software attribute , 1982 .

[17]  Flaviu Cristian A Rigorous Approach to Fault-Tolerant System Development (Extended Abstract) , 1983, Logic of Programs.

[18]  Jim Gray,et al.  Notes on Data Base Operating Systems , 1978, Advanced Course: Operating Systems.

[19]  Jeannette M. Wing A TWO-TIERED APPROACH TO SPECIFYING PROGRAMS , 1983 .

[20]  Greg Thiel,et al.  LOCUS a network transparent, high reliability distributed system , 1981, SOSP.