论文信息 - The inflection point hypothesis: a principled debugging approach for locating the root cause of a failure

The inflection point hypothesis: a principled debugging approach for locating the root cause of a failure

The end goal of failure diagnosis is to locate the root cause. Prior root cause localization approaches almost all rely on statistical analysis. This paper proposes taking a different approach based on the observation that if we model an execution as a totally ordered sequence of instructions, then the root cause can be identified by the first instruction where the failure execution deviates from the non-failure execution that has the longest instruction sequence prefix in common with that of the failure execution. Thus, root cause analysis is transformed into a principled search problem to identify the non-failure execution with the longest common prefix. We present Kairux, a tool that does just that. It is, in most cases, capable of pinpointing the root cause of a failure in a distributed system, in a fully automated way. Kairux uses tests from the system's rich unit test suite as building blocks to construct the non-failure execution that has the longest common prefix with the failure execution in order to locate the root cause. By evaluating Kairux on some of the most complex, real-world failures from HBase, HDFS, and ZooKeeper, we show that Kairux can accurately pinpoint each failure's respective root cause.

[1] Satish Narayanasamy,et al. DoublePlay: parallelizing sequential logging and replay , 2011, ASPLOS XVI.

[2] Chao Liu,et al. Statistical Debugging: A Hypothesis Testing-Based Approach , 2006, IEEE Transactions on Software Engineering.

[3] Rodrigo Fonseca,et al. Pivot tracing , 2018, USENIX ATC.

[4] Satish Narayanasamy,et al. BugNet: continuously recording program execution for deterministic replay debugging , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[5] Liang Guo,et al. Accurately Choosing Execution Runs for Software Fault Localization , 2006, CC.

[6] F. Paul Wilson,et al. Root Cause Analysis : A Tool for Total Quality Management , 1993 .

[7] Joseph Robert Horgan,et al. Dynamic program slicing , 1990, PLDI '90.

[8] Michael I. Jordan,et al. Scalable statistical bug isolation , 2005, PLDI '05.

[9] Yuanyuan Zhou,et al. Triage: diagnosing production run failures at the user's site , 2007, SOSP.

[10] Yu Luo,et al. Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems , 2014, OSDI.

[11] Jason Nieh,et al. Record and transplay: partial checkpointing for replay debugging across heterogeneous systems , 2011, PERV.

[12] Ding Yuan,et al. Pensieve: Non-Intrusive Failure Reproduction for Distributed Systems using the Event Chaining Approach , 2017, SOSP.

[13] George Candea,et al. Debug Determinism: The Sweet Spot for Replay-Based Debugging , 2011, HotOS.

[14] Mary Jean Harrold,et al. Empirical evaluation of the tarantula automatic fault-localization technique , 2005, ASE.

[15] Samuel T. King,et al. ReVirt: enabling intrusion analysis through virtual-machine logging and replay , 2002, OPSR.

[16] George Candea,et al. Execution synthesis: a technique for automated software debugging , 2010, EuroSys '10.

[17] Ion Stoica,et al. ODR: output-deterministic replay for multicore debugging , 2009, SOSP '09.

[18] Randy H. Katz,et al. X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[19] David W. Binkley,et al. Program slicing , 2008, 2008 Frontiers of Software Maintenance.

[20] Michael I. Jordan,et al. Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[21] H. Cleve,et al. Locating causes of program failures , 2005, Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005..

[22] J Salvage. A root cause? , 1982, Nursing times.

[23] Richard Mortier,et al. Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[24] Jennifer Neville,et al. Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems , 2012, NSDI.

[25] Yuriy Brun,et al. Finding latent code errors via machine learning over program executions , 2004, Proceedings. 26th International Conference on Software Engineering.

[26] Andreas Zeller,et al. Simplifying and Isolating Failure-Inducing Input , 2002, IEEE Trans. Software Eng..

[27] Ding Yuan,et al. SherLog: error diagnosis by connecting clues from run-time logs , 2010, ASPLOS XV.

[28] Vikram S. Adve,et al. Using likely invariants for automated software fault localization , 2013, ASPLOS '13.

[29] Shan Lu,et al. Leveraging the short-term memory of hardware to diagnose production-run software failures , 2014, ASPLOS.

[30] Yu Luo,et al. lprof: A Non-intrusive Request Flow Profiler for Distributed Systems , 2014, OSDI.

[31] Dawson R. Engler,et al. KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs , 2008, OSDI.

[32] Hans A. Hansson,et al. Using deterministic replay for debugging of distributed real-time systems , 2000, Proceedings 12th Euromicro Conference on Real-Time Systems. Euromicro RTS 2000.

[33] Junfeng Yang,et al. Parrot: a practical runtime for deterministic, stable, and reliable threads , 2013, SOSP.

[34] Lionel C. Briand,et al. Using Machine Learning to Support Debugging with Tarantula , 2007, The 18th IEEE International Symposium on Software Reliability (ISSRE '07).

[35] Michael Chow,et al. Eidetic Systems , 2014, OSDI.

[36] Ben Niu,et al. REPT: Reverse Debugging of Failures in Deployed Software , 2018, OSDI.

[37] Emery D. Berger,et al. Dthreads: efficient deterministic multithreading , 2011, SOSP.

[38] George Candea,et al. Failure sketching: a technique for automated root cause diagnosis of in-production failures , 2015, SOSP.

[39] A. Zeller. Isolating cause-effect chains from computer programs , 2002, SIGSOFT '02/FSE-10.

[40] William G. Griswold,et al. Dynamically discovering likely program invariants to support program evolution , 1999, Proceedings of the 1999 International Conference on Software Engineering (IEEE Cat. No.99CB37002).

[41] Manu Sridharan,et al. PSE: explaining program failures via postmortem static analysis , 2004, SIGSOFT '04/FSE-12.

[42] Yuanyuan Zhou,et al. PRES: probabilistic replay with execution sketching on multiprocessors , 2009, SOSP '09.

[43] Andreas Zeller,et al. Yesterday, my program worked. Today, it does not. Why? , 1999, ESEC/FSE-7.

[44] Donald Beaver,et al. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[45] Mona Attariyan,et al. X-ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software , 2012, OSDI.

[46] Jean-Claude Laprie,et al. Dependable computing: concepts, limits, challenges , 1995 .

[47] Gregory Tassey,et al. Prepared for what , 2007 .

[48] Yu Luo,et al. Non-Intrusive Performance Profiling for Entire Software Stacks Based on the Flow Reconstruction Principle , 2016, OSDI.