Fault Management in Distributed Systems: A Policy-Driven Approach

Managing the availability and performance of a distributed system involves monitoring the behavior of the system, identifying system problems, and correcting those problems. Each of these tasks requires some expertise, such as an understanding of the mechanics of the underlying system components. As the size and complexity of these systems increases, and the number of distributed applications executing on these systems increases, managing the availability and performance of distributed systems becomes more difficult. Little research has focused on embedding systems management expertise into a management application for a distributed system. In this paper we describe a rule-based management application for a commercially available distributed computing environment that is capable of monitoring the distributed system, detecting system service-related performance and availability problems, and generating corrective actions to correct the problems.

[1]  T. Koch,et al.  Policy definition language for automated management of distributed systems , 1996, Proceedings of IEEE International Workshop on System Management.

[2]  Michael Anthony Bauer,et al.  Configuration Maintenance for Distributed Applications Management , 1997, Journal of Network and Systems Management.

[3]  Seraphin B. Calo,et al.  Alarm correlation and fault identification in communication networks , 1994, IEEE Trans. Commun..

[4]  Michael Anthony Bauer,et al.  Making distributed applications manageable through instrumentation , 1997, Proceedings of PDSE '97: 2nd International Workshop on Software Engineering for Parallel and Distributed Systems.

[5]  Morris Sloman,et al.  Implementation of a Management Agent fo r Interpreting Obligation Policy , 1996 .

[6]  Rajiv Tewari,et al.  A General Object Model for the Management of Distributed Applications , 1996 .

[7]  M. Sloman Network and distributed systems management , 1994 .

[8]  Emil C. Lupu,et al.  Conflict Analysis for Management Policies , 1997, Integrated Network Management.

[9]  Salvatore J. Stolfo,et al.  A coding approach to event correlation , 1995, Integrated Network Management.

[10]  Seraphin B. Calo,et al.  Towards a practical alarm correlation system , 1995, Integrated Network Management.

[11]  Lundy M. Lewis,et al.  A Case-Based Reasoning Approach to the Resolution of Faults in Communication Networks , 1993, Integrated Network Management.

[12]  René Wies,et al.  Using a classification of management policies for policy specification and policy transformation , 1995, Integrated Network Management.

[13]  Morris Sloman,et al.  Policies Hierarchies for Distributed Systems Management , 1993, IEEE J. Sel. Areas Commun..

[14]  S. Katker A modeling framework for integrated distributed systems fault management , 1996 .

[15]  Emil C. Lupu,et al.  A role based framework for distributed systems management , 1998 .

[16]  Bernd J. Krämer,et al.  Towards a Comprehensive Distributed Systems Management , 1995 .

[17]  T. Koch,et al.  On a rule based management architecture , 1995, Second International Workshop on Services in Distributed and Networked Environments.

[18]  Gabi Dreo Rodosek,et al.  Using master tickets as a storage for problem-solving expertise , 1995, Integrated Network Management.

[19]  Martin Paterok,et al.  Event Correlation in Heterogeneous Networks Using the OSI Management Framework , 1993, Integrated Network Management.

[20]  Thomas Kunz,et al.  Services Supporting Management of Distributed Applications and Systems , 1997, IBM Syst. J..

[21]  John A. McDermid,et al.  Policies for Safety-Critical Systems: the Challenge of Formalisation , 1994 .

[22]  Morris Sloman,et al.  Policy driven management for distributed systems , 1994, Journal of Network and Systems Management.

[23]  James Won-Ki Hong,et al.  Reference Architecture for Distributed Systems Management , 1994, IBM Syst. J..

[24]  René Wies,et al.  Policies in network and systems management—Formal definition and architecture , 1994, Journal of Network and Systems Management.

[25]  Hanan Lutfiyya,et al.  Efficient management data acquisition and run-time control of DCE applications using the OSI management framework , 1996, Proceedings of IEEE International Workshop on System Management.

[26]  Michael Anthony Bauer,et al.  Making distributed applications manageable through instrumentation , 1999, J. Syst. Softw..

[27]  Ward Rosenberry,et al.  Understanding DCE , 1992 .

[28]  Stefan Kätker,et al.  Fault Isolation and Event Correlation for Integrated Fault Management , 1997, Integrated Network Management.

[29]  M. Sloman,et al.  Domains: a framework for structuring management policy , 1994 .

[30]  G. Jakobson,et al.  Alarm correlation , 1993, IEEE Network.

[31]  W. Kehl,et al.  Model-based reasoning for the management of telecommunication networks , 1993, Proceedings of ICC '93 - IEEE International Conference on Communications.

[32]  Morris Sloman,et al.  Policy Conflict Analysis in Distributed System Management , 1994 .