Pinpoint: problem determination in large, dynamic Internet services

Traditional problem determination techniques rely on static dependency models that are difficult to generate accurately in today's large, distributed, and dynamic application environments such as e-commerce systems. We present a dynamic analysis methodology that automates problem determination in these environments by 1) coarse-grained tagging of numerous real client requests as they travel through the system and 2) using data mining techniques to correlate the believed failures and successes of these requests to determine which components are most likely to be at fault. To validate our methodology, we have implemented Pinpoint, a framework for root cause analysis on the J2EE platform that requires no knowledge of the application components. Pinpoint consists of three parts: a communications layer that traces client requests, a failure detector that uses traffic-sniffing and middleware instrumentation, and a data analysis engine. We evaluate Pinpoint by injecting faults into various application components and show that Pinpoint identifies the faulty components with high accuracy and produces few false-positives.

[1]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[2]  H. Charles Romesburg,et al.  Cluster analysis for researchers , 1984 .

[3]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[4]  Seraphin B. Calo,et al.  Alarm correlation and fault identification in communication networks , 1994, IEEE Trans. Commun..

[5]  Mikhail V. Kiselev PolyAnalyst 2.0: Combination of Statistical Data Preprocessing and Symbolic KDD Technique. , 1995 .

[6]  Isabelle Rouvellou,et al.  Automatic alarm correlation for fault identification , 1995, Proceedings of INFOCOM'95.

[7]  D. Ohsie,et al.  High speed and robust event correlation , 1996, IEEE Commun. Mag..

[8]  Michael Anthony Bauer,et al.  Making distributed applications manageable through instrumentation , 1997, Proceedings of PDSE '97: 2nd International Workshop on Software Engineering for Parallel and Distributed Systems.

[9]  Boris Gruschke A New Approach for Event Correlation based on Dependency Graphs , 1998 .

[10]  Jaesung Choi,et al.  An alarm correlation and fault identification scheme based on OSI managed object classes , 1999, 1999 IEEE International Conference on Communications (Cat. No. 99CH36311).

[11]  Aaron B. Brown,et al.  An active approach to characterizing dynamic dependencies for problem determination in a distributed environment , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).

[12]  Noah Treuhaft,et al.  Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies , 2002 .

[13]  David A. Patterson,et al.  Architecture and Dependability of Large-Scale Internet Services , 2002, IEEE Internet Comput..

[14]  David A. Patterson,et al.  Architecture, operation, and dependability of large-scale Internet services: three case studies , 2002 .