Availability requirement for a fault management server in high-availability communication systems

In this paper, we investigate the availability requirement for the fault management server in high-availability communication systems. According to our study, we find that the availability of the fault management server does not need to be 99.999% in order to guarantee a 99.999% system availability as long as the fail-safe ratio (the probability that the failure of the fault management server will not bring the system down) and the fault coverage ratio (the probability that the failure in the system can be detected and recovered by the fault management server) are sufficiently high. Tradeoffs can be made among the availability of the fault management server, the fail-safe ratio and the fault coverage ratio to optimize system availability. A cost-effective design for the fault management server is proposed in this paper.

[1]  John S. Baras,et al.  Automated network fault management , 1997, MILCOM 97 MILCOM 97 Proceedings.

[2]  L. F. Pau,et al.  Artificial Intelligence in Communications Networks Monitoring, Diagnosis and Operations , 1989 .

[3]  Marcelo Lubaszewski,et al.  A Reliable Fail-Safe System , 1998, IEEE Trans. Computers.

[4]  Bernard Courtois,et al.  A generalized theory of fail-safe systems , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[5]  Rajeev Gopal,et al.  Layered model for supporting fault isolation and recovery , 2000, NOMS 2000. 2000 IEEE/IFIP Network Operations and Management Symposium 'The Networked Planet: Management Beyond 2000' (Cat. No.00CB37074).

[6]  Veena B. Mendiratta Reliability analysis of clustered computing systems , 1998, Proceedings Ninth International Symposium on Software Reliability Engineering (Cat. No.98TB100257).

[7]  Kishor S. Trivedi,et al.  Performance And Reliability Analysis Of Computer Systems (an Example-based Approach Using The Sharpe Software , 1997, IEEE Transactions on Reliability.

[8]  Michael R. Lyu,et al.  Software fault tolerance in a clustered architecture: techniques and reliability modeling , 1999, 1999 IEEE Aerospace Conference. Proceedings (Cat. No.99TH8403).

[9]  Yutaka Hata,et al.  On design of fail-safe cellular arrays , 1996, Proceedings of the Fifth Asian Test Symposium (ATS'96).

[10]  Roger M. Y. Ho,et al.  Goal programming and extensions , 1976 .

[11]  José Marcos S. Nogueira,et al.  An automatic fault diagnosis and correction system for telecommunications management , 1999, Integrated Network Management VI. Distributed Management for the Networked Millennium. Proceedings of the Sixth IFIP/IEEE International Symposium on Integrated Network Management. (Cat. No.99EX302).

[12]  Jean Arlat,et al.  Available fail-safe systems , 1997, Proceedings of the Sixth IEEE Computer Society Workshop on Future Trends of Distributed Computing Systems.

[13]  Kishor S. Trivedi,et al.  Performance and Reliability Analysis of Computer Systems , 1996, Springer US.