A compositional approach to monitoring distributed systems

This paper proposes a specification-based monitoring approach for automatic run-time detection of software errors and failures of distributed systems. The specification is assumed to be expressed in communicating finite state machines based formalism. The monitor observes the external I/O and partial state information of the target distributed system and uses them to interpret the specification. The approach is compositional as it achieves global monitoring by combining the component-level monitoring. The core of the paper describes the architecture and operations of the monitor The monitor includes several independent mechanisms, each tailored to detecting specific kinds of errors or failures. Their operations are described in detail using illustrative examples. Techniques for dealing with nondeterminism and concurrency issues in monitoring a distributed system are also discussed with respect to the considered model and specification. A case study describing the application of the prototype monitor to an embedded system is presented.

[1]  Ravishankar K. Iyer,et al.  Hierarchical Error Detection in a Software Implemented Fault Tolerance (SIFT) Environment , 2000, IEEE Trans. Knowl. Data Eng..

[2]  Zhonghua Yang,et al.  Global States and Time in Distributed Systems , 1994 .

[3]  Rudolph E. Seviora,et al.  Toward Automatic Detection of Software Failures , 1998, Computer.

[4]  Willem P. de Roever,et al.  The Need for Compositional Proof Systems: A Survey , 1997, COMPOS.

[5]  Thomas F. La Porta,et al.  Design, implementation, and evaluation of highly available distributed call processing systems , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[6]  Adam A. Porter,et al.  Specification-based testing of reactive software: A case study in technology transfer , 1998, J. Syst. Softw..

[7]  Helen J. Wang,et al.  The Iceberg project: defining the IP and telecom intersection , 1999 .

[8]  Wolfram Schulte,et al.  Spying on Components: A Runtime Verification Technique , 2001 .

[9]  Deepinder P. Sidhu,et al.  Formal Methods for Protocol Testing: A Detailed Study , 1989, IEEE Trans. Software Eng..

[10]  D. Richard Kuhn,et al.  Sources of Failure in the Public Switched Telephone Network , 1997, Computer.

[11]  Guy Juanole,et al.  Observer-A Concept for Formal On-Line Validation of Distributed Systems , 1994, IEEE Trans. Software Eng..

[12]  D Bear,et al.  Principles of telecommunication-traffic engineering , 1980 .

[13]  Priya Narasimhan,et al.  Consistent Object Replication in the external System , 1998, Theory Pract. Object Syst..

[14]  Rudolph E. Seviora,et al.  Automatic failure detection with Conditional-Belief supervisors , 1996, Proceedings of ISSRE '96: 7th International Symposium on Software Reliability Engineering.

[15]  Hermann Kopetz,et al.  Dependability: Basic Concepts and Terminology , 1992 .

[16]  Pankaj Jalote,et al.  Fault tolerance in distributed systems , 1994 .

[17]  Wolfram Schulte,et al.  Conformance Checking of Components Against Their Non-deterministic Specifications , 2001 .

[18]  Ravishankar K. Iyer,et al.  Chameleon: A Software Infrastructure for Adaptive Fault Tolerance , 1999, IEEE Trans. Parallel Distributed Syst..