Yemanja-a layered event correlation engine for multi-domain server farms

Yemanja is a model-based event correlation engine for multi-layer fault diagnosis. It targets complex propagating fault scenarios, and can smoothly correlate low-level network events with high-level application performance alerts related to quality of service violations. Entity-models that represent devices or abstract components encapsulate entity behavior. Distantly associated entities are not explicitly aware of each other, and communicate through event propagation chains. Yemanja's state-based engine supports generic scenario definitions, prioritization of alternate solutions, integrated problem-state and device testing, and simultaneous analysis of overlapping problems. The system of correlation rules was developed based on device, layer, and dependency analysis, and reveals the layered structure of computer networks. The primary objectives of this research include the development of reusable, configuration independent, correlation scenarios; adaptability and the extensibility of the engine to match the constantly changing topology of a multi-domain server farm; and the development of a concise specification language that is relatively simple yet powerful.

[1]  Germán S. Goldszmidt,et al.  Scaling Internet services by dynamic allocation of connections , 1999, Integrated Network Management VI. Distributed Management for the Networked Millennium. Proceedings of the Sixth IFIP/IEEE International Symposium on Integrated Network Management. (Cat. No.99EX302).

[2]  Martin Paterok,et al.  Event Correlation in Heterogeneous Networks Using the OSI Management Framework , 1993, Integrated Network Management.

[3]  Donald D. Chamberlin,et al.  A Complete Guide to DB2 Universal Database , 1998 .

[4]  D. Zager,et al.  Value-oriented network management , 2000, NOMS 2000. 2000 IEEE/IFIP Network Operations and Management Symposium 'The Networked Planet: Management Beyond 2000' (Cat. No.00CB37074).

[5]  Morris Sloman,et al.  GEM: a generalized event monitoring language for distributed systems , 1997, Distributed Syst. Eng..

[6]  B. Dang,et al.  Interconnections, second edition: bridges, routers, switches, and internetworking protocols [Bookshelf] , 2000, IEEE Software.

[7]  Keith McCloghrie,et al.  Definitions of Managed Objects for Bridges , 1991, RFC.

[8]  G. Jakobson,et al.  Alarm correlation , 1993, IEEE Network.

[9]  Radia Perlman,et al.  Interconnections: Bridges, Routers, Switches, and Internetworking Protocols , 1999 .

[10]  Steven Waldbusser Remote Network Monitoring Management Information Base , 1991, RFC.

[11]  Benny Rochwerger,et al.  Oceano-SLA based management of a computing utility , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).

[12]  Mats Carlsson,et al.  SICStus Prolog User''s Manual , 1993 .

[13]  Stefan Kiitker A Modeling Framework for Integrated Distributed Systems Fault Management , 1996 .

[14]  Jeffrey D. Case,et al.  Simple Network Management Protocol (SNMP) , 1990, RFC.

[15]  Mischa Schwartz,et al.  Schemes for fault identification in communication networks , 1995, TNET.

[16]  Rajeev Gopal,et al.  Layered model for supporting fault isolation and recovery , 2000, NOMS 2000. 2000 IEEE/IFIP Network Operations and Management Symposium 'The Networked Planet: Management Beyond 2000' (Cat. No.00CB37074).

[17]  Yossi A. Nygate,et al.  Event correlation using rule and object based techniques , 1995, Integrated Network Management.

[18]  D. Ohsie,et al.  High speed and robust event correlation , 1996, IEEE Commun. Mag..

[19]  Guangtian Liu,et al.  Composite events for network event correlation , 1999, Integrated Network Management VI. Distributed Management for the Networked Millennium. Proceedings of the Sixth IFIP/IEEE International Symposium on Integrated Network Management. (Cat. No.99EX302).

[20]  Andrew Hiles Service Level Agreements: Measuring Cost and Quality in Service Relationships , 1993 .

[21]  Marshall T. Rose,et al.  Management Information Base for network management of TCP/IP-based internets , 1990, RFC.

[22]  Peng Wu,et al.  Alarm correlation engine (ACE) , 1998, NOMS 98 1998 IEEE Network Operations and Management Symposium.

[23]  Seraphin B. Calo,et al.  Towards a practical alarm correlation system , 1995, Integrated Network Management.

[24]  Ramesh Viswanathan,et al.  A conceptual framework for network management event correlation and filtering systems , 1999, Integrated Network Management VI. Distributed Management for the Networked Millennium. Proceedings of the Sixth IFIP/IEEE International Symposium on Integrated Network Management. (Cat. No.99EX302).