The inflection point hypothesis: a principled debugging approach for locating the root cause of a failure

The end goal of failure diagnosis is to locate the root cause. Prior root cause localization approaches almost all rely on statistical analysis. This paper proposes taking a different approach based on the observation that if we model an execution as a totally ordered sequence of instructions, then the root cause can be identified by the first instruction where the failure execution deviates from the non-failure execution that has the longest instruction sequence prefix in common with that of the failure execution. Thus, root cause analysis is transformed into a principled search problem to identify the non-failure execution with the longest common prefix. We present Kairux, a tool that does just that. It is, in most cases, capable of pinpointing the root cause of a failure in a distributed system, in a fully automated way. Kairux uses tests from the system's rich unit test suite as building blocks to construct the non-failure execution that has the longest common prefix with the failure execution in order to locate the root cause. By evaluating Kairux on some of the most complex, real-world failures from HBase, HDFS, and ZooKeeper, we show that Kairux can accurately pinpoint each failure's respective root cause.

[1]  Satish Narayanasamy,et al.  DoublePlay: parallelizing sequential logging and replay , 2011, ASPLOS XVI.

[2]  Chao Liu,et al.  Statistical Debugging: A Hypothesis Testing-Based Approach , 2006, IEEE Transactions on Software Engineering.

[3]  Rodrigo Fonseca,et al.  Pivot tracing , 2018, USENIX ATC.

[4]  Satish Narayanasamy,et al.  BugNet: continuously recording program execution for deterministic replay debugging , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[5]  Liang Guo,et al.  Accurately Choosing Execution Runs for Software Fault Localization , 2006, CC.

[6]  F. Paul Wilson,et al.  Root Cause Analysis : A Tool for Total Quality Management , 1993 .

[7]  Joseph Robert Horgan,et al.  Dynamic program slicing , 1990, PLDI '90.

[8]  Michael I. Jordan,et al.  Scalable statistical bug isolation , 2005, PLDI '05.

[9]  Yuanyuan Zhou,et al.  Triage: diagnosing production run failures at the user's site , 2007, SOSP.

[10]  Yu Luo,et al.  Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems , 2014, OSDI.

[11]  Jason Nieh,et al.  Record and transplay: partial checkpointing for replay debugging across heterogeneous systems , 2011, PERV.

[12]  Ding Yuan,et al.  Pensieve: Non-Intrusive Failure Reproduction for Distributed Systems using the Event Chaining Approach , 2017, SOSP.

[13]  George Candea,et al.  Debug Determinism: The Sweet Spot for Replay-Based Debugging , 2011, HotOS.

[14]  Mary Jean Harrold,et al.  Empirical evaluation of the tarantula automatic fault-localization technique , 2005, ASE.

[15]  Samuel T. King,et al.  ReVirt: enabling intrusion analysis through virtual-machine logging and replay , 2002, OPSR.

[16]  George Candea,et al.  Execution synthesis: a technique for automated software debugging , 2010, EuroSys '10.

[17]  Ion Stoica,et al.  ODR: output-deterministic replay for multicore debugging , 2009, SOSP '09.

[18]  Randy H. Katz,et al.  X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[19]  David W. Binkley,et al.  Program slicing , 2008, 2008 Frontiers of Software Maintenance.

[20]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[21]  H. Cleve,et al.  Locating causes of program failures , 2005, Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005..

[22]  J Salvage A root cause? , 1982, Nursing times.

[23]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[24]  Jennifer Neville,et al.  Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems , 2012, NSDI.

[25]  Yuriy Brun,et al.  Finding latent code errors via machine learning over program executions , 2004, Proceedings. 26th International Conference on Software Engineering.

[26]  Andreas Zeller,et al.  Simplifying and Isolating Failure-Inducing Input , 2002, IEEE Trans. Software Eng..

[27]  Ding Yuan,et al.  SherLog: error diagnosis by connecting clues from run-time logs , 2010, ASPLOS XV.

[28]  Vikram S. Adve,et al.  Using likely invariants for automated software fault localization , 2013, ASPLOS '13.

[29]  Shan Lu,et al.  Leveraging the short-term memory of hardware to diagnose production-run software failures , 2014, ASPLOS.

[30]  Yu Luo,et al.  lprof: A Non-intrusive Request Flow Profiler for Distributed Systems , 2014, OSDI.

[31]  Dawson R. Engler,et al.  KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs , 2008, OSDI.

[32]  Hans A. Hansson,et al.  Using deterministic replay for debugging of distributed real-time systems , 2000, Proceedings 12th Euromicro Conference on Real-Time Systems. Euromicro RTS 2000.

[33]  Junfeng Yang,et al.  Parrot: a practical runtime for deterministic, stable, and reliable threads , 2013, SOSP.

[34]  Lionel C. Briand,et al.  Using Machine Learning to Support Debugging with Tarantula , 2007, The 18th IEEE International Symposium on Software Reliability (ISSRE '07).

[35]  Michael Chow,et al.  Eidetic Systems , 2014, OSDI.

[36]  Ben Niu,et al.  REPT: Reverse Debugging of Failures in Deployed Software , 2018, OSDI.

[37]  Emery D. Berger,et al.  Dthreads: efficient deterministic multithreading , 2011, SOSP.

[38]  George Candea,et al.  Failure sketching: a technique for automated root cause diagnosis of in-production failures , 2015, SOSP.

[39]  A. Zeller Isolating cause-effect chains from computer programs , 2002, SIGSOFT '02/FSE-10.

[40]  William G. Griswold,et al.  Dynamically discovering likely program invariants to support program evolution , 1999, Proceedings of the 1999 International Conference on Software Engineering (IEEE Cat. No.99CB37002).

[41]  Manu Sridharan,et al.  PSE: explaining program failures via postmortem static analysis , 2004, SIGSOFT '04/FSE-12.

[42]  Yuanyuan Zhou,et al.  PRES: probabilistic replay with execution sketching on multiprocessors , 2009, SOSP '09.

[43]  Andreas Zeller,et al.  Yesterday, my program worked. Today, it does not. Why? , 1999, ESEC/FSE-7.

[44]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[45]  Mona Attariyan,et al.  X-ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software , 2012, OSDI.

[46]  Jean-Claude Laprie,et al.  Dependable computing: concepts, limits, challenges , 1995 .

[47]  Gregory Tassey,et al.  Prepared for what , 2007 .

[48]  Yu Luo,et al.  Non-Intrusive Performance Profiling for Entire Software Stacks Based on the Flow Reconstruction Principle , 2016, OSDI.