Analyzing the effectiveness of fault-management architectures in layered distributed systems

Fault management infrastructure in distributed systems includes manager processes and agents with various kinds of interactions for monitoring and surveillance of the status of the application software and hardware. The system architecture now includes these additional components and interactions, and they affect the system availability. This paper describes an architecture model called MAMA (Model for Availability Management Architecture) with an architecture definition language MAMA-dl for the combination of the application and management parts, and its analysis. The analysis extends the Fault Tolerant Layered Queueing Model to account for propagation of knowledge of the system state in the management sub-architecture. The model is demonstrated on a problem of placement of manager tasks in a system.

[1]  Luiz A. Laranjeira,et al.  NCAPS: application high availability in Unix computer clusters , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[2]  C. Murray Woodside,et al.  Modeling the coverage and effectiveness of fault-management architectures in layered distributed systems , 2002, Proceedings International Conference on Dependable Systems and Networks.

[3]  Rachid Guerraoui,et al.  The Implementation of a CORBA Object Group Service , 1998, Theory Pract. Object Syst..

[4]  Gunter Bolch,et al.  Queueing Networks and Markov Chains - Modeling and Performance Evaluation with Computer Science Applications, Second Edition , 1998 .

[5]  C. Murray Woodside,et al.  Evaluating layered distributed software systems with fault-tolerant features , 2001, Perform. Evaluation.

[6]  Heather Kreger Java Management Extensions for application management , 2001, IBM Syst. J..

[7]  Lillian N. Cassel,et al.  Network management architectures and protocols: problems and approaches , 1989, IEEE J. Sel. Areas Commun..

[8]  Charles J. Colbourn,et al.  The Combinatorics of Network Reliability , 1987 .

[9]  Marshall T. Rose,et al.  The simple book : an introduction to internet management , 1994 .

[10]  Joanne Bechta Dugan,et al.  A combinatorial approach to modeling imperfect coverage , 1995 .

[11]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[12]  Joanne Bechta Dugan,et al.  Fault trees and imperfect coverage , 1989 .

[13]  Deron Liang,et al.  NT-SwiFT: software implemented fault tolerance on Windows NT , 2004, J. Syst. Softw..

[14]  Gregor von Laszewski,et al.  A fault detection service for wide area distributed computations , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[15]  Katerina Goseva-Popstojanova,et al.  Architecture-based approach to reliability assessment of software systems , 2001, Perform. Evaluation.

[16]  C. Murray Woodside,et al.  Performance analysis of distributed server systems , 2000 .

[17]  Swapna S. Gokhale,et al.  An analytical approach to architecture-based software reliability prediction , 1998, Proceedings. IEEE International Computer Performance and Dependability Symposium. IPDS'98 (Cat. No.98TB100248).

[18]  O. Das,et al.  The fault-tolerant layered queueing network model for performability of distributed systems , 1998, Proceedings. IEEE International Computer Performance and Dependability Symposium. IPDS'98 (Cat. No.98TB100248).

[19]  Priya Narasimhan,et al.  A fault tolerance framework for CORBA , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[20]  Suprasad V. Amari,et al.  A separable method for incorporating imperfect fault-coverage into combinatorial models , 1999 .

[21]  Boudewijn R.H.M. Haverkort,et al.  Performability Modelling Using DyQNtool , 1993 .

[22]  Allan Leinwand,et al.  Network Management: A Practical Perspective , 1993 .