Self-healing in large-scale systems: parallel and distributed diagnostic architectures

Automated real-time problem diagnosis is a key feature of a self-healing system. However, rapidly growing size and complexity of modern distributed systems creates a challenge for traditional centralized diagnostic approaches and calls for parallel and distributed architectures. Dividing the system into subsystems controlled by separate diagnostic engines is an obvious choice; however, on top of that, a communication architecture must be provided that allows diagnostic engines to exchange information about common components in order to obtain better diagnosis. In this paper, we discuss a distributed belief propagation approach to diagnosis and provide a scalable parallel and distributed communication architecture that supports efficient message exchange among diagnostic engines.

[1]  Sugih Jamin,et al.  Inet-3.0: Internet Topology Generator , 2002 .

[2]  William T. Freeman,et al.  Constructing free-energy approximations and generalized belief propagation algorithms , 2005, IEEE Transactions on Information Theory.

[3]  Salvatore J. Stolfo,et al.  A coding approach to event correlation , 1995, Integrated Network Management.

[4]  Keith L. Clark,et al.  Content-Based Routing as the Basis for Intra-Agent Communication , 1998, ATAL.

[5]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[6]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[7]  David E. Culler,et al.  SEDA: an architecture for well-conditioned, scalable internet services , 2001, SOSP.

[8]  Gregory F. Cooper,et al.  The Computational Complexity of Probabilistic Inference Using Bayesian Belief Networks , 1990, Artif. Intell..

[9]  Munindar P. Singh,et al.  Agent-based peer-to-peer service networks: a study of effectiveness and structure evolution , 2004, Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004..

[10]  - 1-Peer-to-Peer Overlay Networks : A Survey , 2003 .

[11]  Rajkumar Buyya,et al.  Peer-to-Peer Networks for Content Sharing , 2005 .

[12]  Ben Y. Zhao,et al.  An Infrastructure for Fault-tolerant Wide-area Location and Routing , 2001 .

[13]  Robert Morris,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM 2001.

[14]  Hector Garcia-Molina,et al.  Designing a super-peer network , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[15]  Sheng Ma,et al.  Accuracy vs. efficiency trade-offs in probabilistic diagnosis , 2002, AAAI/IAAI.

[16]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM 2001.

[17]  Sheng Ma,et al.  Intelligent probing: A cost-effective approach to fault diagnosis in computer networks , 2002, IBM Syst. J..

[18]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[19]  Sheng Ma,et al.  Optimizing Probe Selection for Fault Localization , 2001, DSOM.

[20]  Ian T. Foster,et al.  A peer-to-peer approach to resource location in grid environments , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[21]  Rina Dechter,et al.  Bucket elimination: A unifying framework for probabilistic inference , 1996, UAI.

[22]  Yang Xiang,et al.  Comparison of multiagent inference methods in multiply sectioned Bayesian networks , 2003, Int. J. Approx. Reason..

[23]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[24]  Ian Clarke,et al.  Freenet: A Distributed Anonymous Information Storage and Retrieval System , 2000, Workshop on Design Issues in Anonymity and Unobservability.

[25]  Sheng Ma,et al.  Real-time problem determination in distributed systems using active probing , 2004, 2004 IEEE/IFIP Network Operations and Management Symposium (IEEE Cat. No.04CH37507).

[26]  Timothy W. Finin,et al.  Kqml: an information and knowledge exchange protocol , 1994 .