Using correlated surprise to infer shared influence

We propose a method for identifying the sources of problems in complex production systems where, due to the prohibitive costs of instrumentation, the data available for analysis may be noisy or incomplete. In particular, we may not have complete knowledge of all components and their interactions. We define influences as a class of component interactions that includes direct communication and resource contention. Our method infers the influences among components in a system by looking for pairs of components with time-correlated anomalous behavior. We summarize the strength and directionality of shared influences using a Structure-of-Influence Graph (SIG). This paper explains how to construct a SIG and use it to isolate system misbehavior, and presents both simulations and in-depth case studies with two autonomous vehicles and a 9024-node production supercomputer.

[1]  Sailesh Chutani,et al.  On the distributed fault diagnosis of computer networks , 1995, Proceedings IEEE Symposium on Computers and Communications.

[2]  Jon Stearley,et al.  What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[3]  Sheng Ma,et al.  Real-time problem determination in distributed systems using active probing , 2004, 2004 IEEE/IFIP Network Operations and Management Symposium (IEEE Cat. No.04CH37507).

[4]  Alexander Aiken,et al.  Alert Detection in System Logs , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[5]  Paramvir Bahl,et al.  Towards highly reliable enterprise network services via inference of multi-level dependencies , 2007, SIGCOMM '07.

[6]  Peter C. Bates,et al.  Debugging heterogeneous distributed systems using event-based models of behavior , 1988, PADD '88.

[7]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[8]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[9]  Armando Fox,et al.  Capturing, indexing, clustering, and retrieving system history , 2005, SOSP '05.

[10]  David A. Patterson,et al.  Path-Based Failure and Evolution Management , 2004, NSDI.

[11]  Marcos K. Aguilera,et al.  WAP5: black-box performance debugging for wide-area systems , 2006, WWW '06.

[12]  Albert G. Greenberg,et al.  IP fault localization via risk modeling , 2005, NSDI.

[13]  Sebastian Thrun,et al.  Stanley: The robot that won the DARPA Grand Challenge , 2006, J. Field Robotics.

[14]  Sebastian Thrun,et al.  Junior: The Stanford entry in the Urban Challenge , 2008, J. Field Robotics.

[15]  Herbert A. Sturges,et al.  The Choice of a Class Interval , 1926 .

[16]  Dirk Haehnel,et al.  Junior: The Stanford entry in the Urban Challenge , 2008 .

[17]  Sebastian Thrun,et al.  A Personal Account of the Development of Stanley, the Robot That Won the DARPA Grand Challenge , 2006, AI Mag..

[18]  Amin Vahdat,et al.  Pip: Detecting the Unexpected in Distributed Systems , 2006, NSDI.

[19]  Sheng Ma,et al.  Optimizing Probe Selection for Fault Localization , 2001, DSOM.

[20]  Aaron B. Brown,et al.  An active approach to characterizing dynamic dependencies for problem determination in a distributed environment , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).

[21]  Srikanth Kandula,et al.  Shrink: a tool for failure diagnosis in IP networks , 2005, MineNet '05.

[22]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.