Using Reflection for Incorporating Fault-Tolerance Techniques into Distributed Applications

As part of the Legion metacomputing project, we have developed a reflective model, the Reflective Graph & Event (RGE) model, for incorporating functionality into applications. In this paper we apply the RGE model to the problem of making applications more robust to failure. RGE encourages system developers to express fault-tolerance algorithms in terms of transformations on the data structures that represent computations--messages and methods--hence enabling the construction of generic and reusable fault-tolerance components. We illustrate the expressive power of the RGE by encapsulating the following fault-tolerance techniques into RGE components: two-phase commit distributed checkpointing, passive replication, pessimistic method logging, and forward recovery.

[1]  Gary M. Koob,et al.  Foundations of dependable computing : models and frameworks for dependable systems , 1994 .

[2]  Pattie Maes,et al.  Concepts and experiments in computational reflection , 1987, OOPSLA '87.

[3]  Patricia Charlton,et al.  Self-configurable software agents , 1999 .

[4]  Robert G. Babb,et al.  Parallel Processing with Large-Grain Data Flow Techniques , 1984, Computer.

[5]  Matti A. Hiltunen,et al.  Coyote: a system for constructing fine-grain configurable communication services , 1998, TOCS.

[6]  Jean-Charles Fabre,et al.  Implementing fault tolerant applications using reflective object-oriented programming , 1995 .

[7]  Yennun Huang,et al.  A software fault tolerance platform , 1995 .

[8]  Lorenzo Alvisi,et al.  Paralex: an environment for parallel programming in distributed systems , 1991, ICS '92.

[9]  Jack Dongarra,et al.  HeNCE: graphical development tools for network-based concurrent computing , 1992, Proceedings Scalable High Performance Computing Conference SHPCC-92..

[10]  Andrew S. Grimshaw,et al.  Exploiting Data-Flow for Fault-Tolerance in a Wide-Area Parallel System , 1996, SRDS.

[11]  Andrew S. Grimshaw,et al.  Portable run-time support for dynamic object-oriented parallel processing , 1996, TOCS.

[12]  Jean-Charles Fabre,et al.  A Metaobject Architecture for Fault-Tolerant Distributed Systems: The FRIENDS Approach , 1998, IEEE Trans. Computers.

[13]  MaesPattie Concepts and experiments in computational reflection , 1987 .

[14]  Andrew S. Grimshaw,et al.  Using Reflection for Flexibility and Extensibility in a Metacomputing Environment , 1998 .

[15]  Adam J. Ferrari,et al.  Process Introspection: A Checkpoint Mechanism for High Performance Heterogeneous Distributed Systems , 1996 .

[16]  Andrew S. Grimshaw,et al.  The core Legion object model , 1996, Proceedings of 5th IEEE International Symposium on High Performance Distributed Computing.

[17]  Chris Zimmermann Advances in Object-Oriented Metalevel Architectures and Reflection , 1996 .

[18]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[19]  Ian T. Foster,et al.  Globus: a Metacomputing Infrastructure Toolkit , 1997, Int. J. High Perform. Comput. Appl..

[20]  Robert J. Stroud,et al.  Implementing fault tolerant applications using reflective object-oriented programming , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[21]  Taesoon Park,et al.  Checkpointing and rollback-recovery in distributed systems , 1989 .

[22]  David B. Johnson,et al.  Sender-Based Message Logging , 1987 .

[23]  Pankaj Jalote,et al.  Fault tolerance in distributed systems , 1994 .

[24]  James C. Browne,et al.  Experimental Evaluation of a Reusability-Oriented Parallel Programming Environment , 1990, IEEE Trans. Software Eng..

[25]  Gul Agha,et al.  A Methodology for Adapting to Patterns of Faults , 1994 .

[26]  Daniel G. Bobrow,et al.  Book review: The Art of the MetaObject Protocol By Gregor Kiczales, Jim des Rivieres, Daniel G. and Bobrow(MIT Press, 1991) , 1991, SGAR.

[27]  Brian N. Bershad,et al.  Dynamic binding for an extensible system , 1996, OSDI '96.

[28]  Peter Alan Lee,et al.  Fault Tolerance , 1990, Dependable Computing and Fault-Tolerant Systems.

[29]  Joseph L. Zachary,et al.  Reflections on Metaprogramming , 1995, IEEE Trans. Software Eng..

[30]  Mark Garland Hayden,et al.  The Ensemble System , 1998 .

[31]  Joel H. Saltz,et al.  Parallel Programming Using C++ , 1996 .

[32]  Andrew S. Grimshaw,et al.  The Legion vision of a worldwide virtual computer , 1997, Commun. ACM.

[33]  Sang Hyuk Son,et al.  BeeHive: Global Multimedia Database Support for Dependable, Real-Time Applications , 1997, ARTDB.

[34]  John F. Karpovich,et al.  Architectural Support for Extensibility and Autonomy in Wide-Area Distributed Object Systems , 1998 .

[35]  Andrew S. Grimshaw,et al.  Enabling Flexibility in the Legion Run-Time Library , 1997, PDPTA.