Provenance for System Troubleshooting

System administrators use a variety of techniques to track down and repair (or avoid) problems that occur in the systems under their purview. Analyzing log files, cross-correlating events on different machines, establishing liveness and performance monitors, and automating configuration procedures are just a few of the approaches used to stave off entropy. These efforts are often stymied by the presence of hidden dependencies between components in a system (e.g., processes, pipes, files, etc). In this paper we argue that system-level provenance (metadata that records the history of files, pipes, processes and other system-level objects) can help expose these dependencies, giving system administrators a more complete picture of component interactions, thus easing the task of troubleshooting.

[1]  Margo I. Seltzer,et al.  Choosing a Data Model and Query Language for Provenance , 2008, IPAW 2008.

[2]  Paramvir Bahl,et al.  Towards highly reliable enterprise network services via inference of multi-level dependencies , 2007, SIGCOMM.

[3]  Jon Stearley,et al.  What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[4]  Helen J. Wang,et al.  Automatic Misconfiguration Troubleshooting with PeerPressure , 2004, OSDI.

[5]  Margo I. Seltzer,et al.  Provenance Map Orbiter: Interactive Exploration of Large Provenance Graphs , 2011, TaPP.

[6]  Margo I. Seltzer,et al.  Making a Cloud Provenance-Aware , 2009, Workshop on the Theory and Practice of Provenance.

[7]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[8]  Michal Aharon,et al.  One Graph Is Worth a Thousand Logs: Uncovering Hidden Structures in Massive System Event Logs , 2009, ECML/PKDD.

[9]  Sheng Ma,et al.  Real-time problem determination in distributed systems using active probing , 2004, 2004 IEEE/IFIP Network Operations and Management Symposium (IEEE Cat. No.04CH37507).

[10]  Margo I. Seltzer,et al.  Provenance for the Cloud , 2010, FAST.

[11]  Helen J. Wang,et al.  Strider: a black-box, state-based approach to change and configuration management and support , 2003, Sci. Comput. Program..

[12]  Daniel W. Margo,et al.  Using Provenance to Extract Semantic File Attributes , 2010, TaPP.

[13]  Paul Krizak Log Analysis and Event Correlation Using Variable Temporal Event Correlator (VTEC) , 2010, LISA.

[14]  E. Prud hommeaux,et al.  SPARQL query language for RDF , 2011 .

[15]  Edward Chuah,et al.  Diagnosing the root-causes of failures from cluster log files , 2010, 2010 International Conference on High Performance Computing.

[16]  Christian Ensel New Approach for Automated Generation of Service Dependency Models , 2001, LANOMS.

[17]  Alexander Aiken,et al.  Alert Detection in System Logs , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[18]  Margo I. Seltzer,et al.  Layering in Provenance Systems , 2009, USENIX Annual Technical Conference.

[19]  Alexander Aiken,et al.  Online detection of multi-component interactions in production systems , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[20]  Shanchao Zhang,et al.  Optical Precursor of a Single Photon , 2011 .

[21]  Samuel T. King,et al.  Backtracking intrusions , 2003, SOSP '03.

[22]  Andrew Warfield,et al.  Xen and the art of virtualization , 2003, SOSP '03.

[23]  Tetsuji Takada,et al.  MieLog: A Highly Interactive Visual Log Browser Using Information Visualization and Statistical Analysis , 2002, LISA.

[24]  Eser Kandogan,et al.  Field studies of computer system administrators: analysis of system management tools and practices , 2004, CSCW.

[25]  Stephen E. Hansen,et al.  Automated System Monitoring and Notification with Swatch , 1993, LISA.

[26]  Aaron B. Brown,et al.  An active approach to characterizing dynamic dependencies for problem determination in a distributed environment , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).

[27]  John P. Rouillard Real-time Log File Analysis Using the Simple Event Correlator (SEC) , 2004, LISA.

[28]  Alexander Keller,et al.  Managing application services over service provider networks: architecture and dependency analysis , 2000, NOMS 2000. 2000 IEEE/IFIP Network Operations and Management Symposium 'The Networked Planet: Management Beyond 2000' (Cat. No.00CB37074).

[29]  Kenny Wong,et al.  Symptom-based problem determination using log data abstraction , 2010, CASCON.

[30]  Alva L. Couch,et al.  Global Impact Analysis of Dynamic Library Dependencies , 2001, LISA.

[31]  Michael Stiber,et al.  A survey of system administrator mental models and situation awareness , 2001, SIGCPR '01.

[32]  Anees Shaikh,et al.  PDA: A Tool for Automated Problem Determination , 2007, LISA.

[33]  Sally Jo Cunningham,et al.  Applications of machine learning in information retrieval , 1999 .

[34]  Margo Seltzer,et al.  Foundations for provenance-aware systems , 2010 .

[35]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.