Fast Extraction of Adaptive Change Point Based Patterns for Problem Resolution in Enterprise Systems

Enterprise middleware systems typically consist of a large cluster of machines with stringent performance requirements. Hence, when a performance problem occurs in such environments, it is critical that the health monitoring software identifies the root cause with minimal delay. A technique commonly used for isolating root causes is rule definition, which involves specifying combinations of events that cause particular problems. However, such predefined rules (or problem signatures) tend to be inflexible, and crucially depend on domain experts for their definition. We present in this paper a method that automatically generates change point based problem signatures using administrator feedback, thereby removing the dependence on domain experts. The problem signatures generated by our method are flexible, in that they do not require exact matches for triggering, and adapt as more information becomes available. Unlike traditional data mining techniques, where one requires a large number of problem instances to extract meaningful patterns, our method requires few fault instances to learn problem signatures. We demonstrate the efficacy of our approach by learning problem signatures for five common problems that occur in enterprise systems and reliably recognizing these problems with a small number of learning instances.

[1]  Malgorzata Steinder,et al.  The present and future of event correlation: A need for end-to-end service fault localization , 2001 .

[2]  Joseph L. Hellerstein An approach to selecting metrics for detecting performance problems in information systems , 1996, Proceedings of IEEE International Workshop on System Management.

[3]  Joseph L. Hellerstein,et al.  Discovering actionable patterns in event data , 2002, IBM Syst. J..

[4]  Malgorzata Steinder,et al.  Yemanja—A Layered Fault Localization System for Multi-Domain Computing Utilities , 2002, Journal of Network and Systems Management.

[5]  Joseph L. Hellerstein An approach to selecting metrics for detecting performance problems in information systems , 1996, SIGMETRICS '96.

[6]  Jie Gao,et al.  Approaches to building self healing systems using dependency analysis , 2004, 2004 IEEE/IFIP Network Operations and Management Symposium (IEEE Cat. No.04CH37507).

[7]  Malgorzata Steinder,et al.  Non-deterministic Event-driven Fault Diagnosis through Incremental Hypothesis Updating , 2003 .

[8]  Stefan Kätker,et al.  Fault Isolation and Event Correlation for Integrated Fault Management , 1997, Integrated Network Management.

[9]  Sheng Ma,et al.  Active Probing Strategies for Problem Diagnosis in Distributed Systems , 2003, IJCAI.

[10]  Jürgen Schönwälder,et al.  Integrated Network Management VIII , 2003, IFIP — The International Federation for Information Processing.

[11]  Rolf Stadler,et al.  Integrated Network Management V , 1997, IFIP — The International Federation for Information Processing.

[12]  Jian Tang,et al.  Mining N-most Interesting Itemsets , 2000, ISMIS.

[13]  D. Ohsie,et al.  High speed and robust event correlation , 1996, IEEE Commun. Mag..

[14]  Manish Gupta,et al.  Problem Determination Using Dependency Graphs and Run-Time Behavior Models , 2004, DSOM.

[15]  Malgorzata Steinder,et al.  Probabilistic event-driven fault diagnosis through incremental hypothesis updating , 2003 .

[16]  Jaesung Choi,et al.  An alarm correlation and fault identification scheme based on OSI managed object classes , 1999, 1999 IEEE International Conference on Communications (Cat. No. 99CH36311).

[17]  Boris Gruschke,et al.  INTEGRATED EVENT MANAGEMENT: EVENT CORRELATION USING DEPENDENCY GRAPHS , 1998 .

[18]  Saurabh Bagchi,et al.  Dependency Analysis in Distributed Systems using Fault Injection: Application to Problem Determination in an e-commerce Environment , 2001, DSOM.

[19]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[20]  Joseph L. Hellerstein,et al.  Mining Event Data for Actionable Patterns , 2000, Int. CMG Conference.

[21]  Karen Appleby,et al.  Threshold management for problem determination in transaction based e-commerce systems , 2005, 2005 9th IFIP/IEEE International Symposium on Integrated Network Management, 2005. IM 2005..

[22]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[23]  Shusaku Tsumoto,et al.  Foundations of Intelligent Systems, 15th International Symposium, ISMIS 2005, Saratoga Springs, NY, USA, May 25-28, 2005, Proceedings , 2005, ISMIS.

[24]  Aaron B. Brown,et al.  An active approach to characterizing dynamic dependencies for problem determination in a distributed environment , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).

[25]  Vijay Mann,et al.  Problem Determination in Enterprise Middleware Systems using Change Point Correlation of Time Series Data , 2006, 2006 IEEE/IFIP Network Operations and Management Symposium NOMS 2006.

[26]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.